Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints †
<p>(<b>a</b>) Multi-view geometry; (<b>b</b>) our setup with multiple virtual viewpoints (the blue camera is real, the other <span class="html-italic">N</span> (here, <span class="html-italic">N</span> = 7) cameras are virtual, and the two green cameras are selected after experiments (<a href="#sec4dot2dot1-sensors-24-08017" class="html-sec">Section 4.2.1</a>)); (<b>c</b>) geometry for depth error analysis.</p> "> Figure 1 Cont.
<p>(<b>a</b>) Multi-view geometry; (<b>b</b>) our setup with multiple virtual viewpoints (the blue camera is real, the other <span class="html-italic">N</span> (here, <span class="html-italic">N</span> = 7) cameras are virtual, and the two green cameras are selected after experiments (<a href="#sec4dot2dot1-sensors-24-08017" class="html-sec">Section 4.2.1</a>)); (<b>c</b>) geometry for depth error analysis.</p> "> Figure 2
<p>(<b>a</b>) Overall architecture of our proposed two-stream method. (<b>b</b>) Detailed architecture of the first-stage network, including the “real” stream (Real-Net) and virtual stream (Virtual-Net). (<b>c</b>) Detailed architecture of the fusion module (FM) in the second stage. <span class="html-italic">N</span> denotes the number of virtual viewpoints, <span class="html-italic">J</span> denotes the number of joints, and <span class="html-italic">D</span> denotes the dimension of the embeddings.</p> "> Figure 2 Cont.
<p>(<b>a</b>) Overall architecture of our proposed two-stream method. (<b>b</b>) Detailed architecture of the first-stage network, including the “real” stream (Real-Net) and virtual stream (Virtual-Net). (<b>c</b>) Detailed architecture of the fusion module (FM) in the second stage. <span class="html-italic">N</span> denotes the number of virtual viewpoints, <span class="html-italic">J</span> denotes the number of joints, and <span class="html-italic">D</span> denotes the dimension of the embeddings.</p> "> Figure 3
<p>Global context information of humans (P1–P3) with the same 3D pose captured from different viewpoints (with horizontal viewing angles of −α, 0, and β, respectively) by the camera.</p> "> Figure 4
<p>The architecture of the fusion network in the fusion module, where <span class="html-italic">N</span> is the total number of virtual viewpoints: (<b>a</b>) DenseFC network; (<b>b</b>) GCN.</p> "> Figure 5
<p>Illustration of the bone vector connections in our system.</p> "> Figure 6
<p>(<b>a</b>) Error distribution across different actions, where the dotted red line refers to the overall MPJPE value of 45.7 mm; (<b>b</b>) average MPJPE of each joint.</p> "> Figure 6 Cont.
<p>(<b>a</b>) Error distribution across different actions, where the dotted red line refers to the overall MPJPE value of 45.7 mm; (<b>b</b>) average MPJPE of each joint.</p> "> Figure 7
<p>Visualized results on the Human3.6M dataset: (<b>a</b>) successful predictions; (<b>b</b>) failed predictions on some joints.</p> "> Figure 8
<p>Qualitative results of the in-the-wild scenarios: (<b>a</b>) successful cases; (<b>b</b>) failed cases.</p> ">
Abstract
:1. Introduction
- We propose a 3D human skeleton estimation method that requires only a single monocular RGB image as the input, making our system realistic in many applications.
- We propose a two-stream method for predicting enhanced 2D skeletons with real and virtual viewpoints, and the outputs are then processed and fused via a cropped-to-original coordinate transform (COCT) module, a depth-denoising (DD) module, and a fusion module (FM) to regress the final 3D human skeletons.
- Our proposed method outperforms single-image-based methods when evaluated on the Human3.6M dataset [12] and achieves a performance comparable to that of state-of-the-art (SOTA) methods based on a long image sequence.
2. Related Work
2.1. Single-View Methods
2.2. Multi-View Methods
3. Proposed Method
3.1. Stage 1: Real-Net and Virtual-Net
3.2. Stage 2-1: Cropped-to-Original Coordinate Transform (COCT)
3.3. Stage 2-2: Depth Denoising (DD) Module
3.4. Stage 2-3: Fusion Module (FM)
3.5. Data Preprocessing and Virtual-Viewpoint Skeleton Generation
3.6. Loss Functions
4. Experimental Results
4.1. Experimental Settings
4.2. Ablation Study
4.2.1. Number of Virtual Viewpoints
4.2.2. Level of Depth Denoising
4.2.3. Availability of the COCT Module
4.2.4. Embedding Network and Fusion Network
4.2.5. Impact of Each Component in Our Proposed Method
4.3. Performance Comparison with State-of-the-Art (SOTA) Methods
4.4. Error and Cost Analysis
4.5. Visualized Results and Real Tests
5. Discussions and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef]
- Boekhoudt, K.; Matei, A.; Aghaei, M.; Talavera, E. HR-Crime: Human-Related Anomaly Detection in Surveillance Videos. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Virtual Event, 28–30 September 2021; pp. 164–174. [Google Scholar]
- Chiang, J.C.; Lie, W.N.; Huang, H.C.; Chen, K.T.; Liang, J.Y.; Lo, Y.C.; Huang, W.H. Posture Monitoring for Health Care of Bedridden Elderly Patients Using 3D Human Skeleton Analysis via Machine Learning Approach. Appl. Sci. 2022, 12, 3087. [Google Scholar] [CrossRef]
- Peppas, K.; Tsiolis, K.; Mariolis, I.; Topalidou-Kyniazopoulou, A.; Tzovaras, D. Multi-modal 3D Human Pose Estimation for Human-Robot Collaborative Applications. In Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, Padua, Italy, 21–22 January 2021; pp. 355–364. [Google Scholar]
- Sharma, S.; Varigonda, P.T.; Bindal, P.; Sharma, A.; Jain, A. Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2325–2334. [Google Scholar]
- Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic Graph Convolutional Networks for 3D Human Pose Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3420–3430. [Google Scholar]
- Cheng, Y.; Yang, B.; Wang, B.; Tan, R.T. 3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence, Hilton, New York Midtown, New York, NY, USA, 7–12 February 2020; pp. 10631–10638. [Google Scholar]
- Lie, W.N.; Yang, P.H.; Vann, V.; Chiang, J.C. 3D Human Skeleton Estimation Based on RGB Image Sequence and Graph Convolution Network. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–6. [Google Scholar]
- Qiu, H.; Wang, C.; Wang, J.; Wang, N.; Zeng, W. Cross View Fusion for 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4341–4350. [Google Scholar]
- Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable Triangulation of Human Pose. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7718–7727. [Google Scholar]
- Chun, J.; Park, S.; Ji, M. 3D Human Pose Estimation from RGB-D Images Using Deep Learning Method. Proceeding of the 2018 International Conference on Sensors, Signal and Image Processing (SSIP), Prague, Czech Republic, 12–14 October 2018; pp. 51–55. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Wu, H.; Xiao, B. 3D Human Pose Estimation via Explicit Compositional Depth Maps. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence, Hilton, New York Midtown, New York, NY, USA, 7–12 February 2020; pp. 12378–12385. [Google Scholar]
- Zhou, F.; Yin, J.; Li, P. Lifting by Image–Leveraging Image Cues for Accurate 3D Human Pose Estimation. In Proceedings of the 2024 AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 7632–7640. [Google Scholar]
- Kang, Y.; Liu, Y.; Yao, A.; Wang, S.; Wu, E. 3D Human Pose Lifting with Grid Convolution. In Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1105–1113. [Google Scholar]
- Xu, T.; Takano, W. Graph Stacked Hourglass Networks for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 16105–16114. [Google Scholar]
- Li, H.; Pun, C.M. CEE-Net: Complementary End-to-End Network for 3D Human Pose Generation and Estimation. In Proceedings of the 2023 AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1305–1313. [Google Scholar]
- Gong, K.; Zhang, J.; Feng, J. PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8575–8584. [Google Scholar]
- Bai, Y.; Wang, L.; Tao, Z.; Li, S.; Fu, Y. Correlative Channel-Aware Fusion for Multi-View Time Series Classification. In Proceedings of the 2021 AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; pp. 6714–6722. [Google Scholar]
- Kim, H.W.; Lee, G.H.; Oh, M.S.; Lee, S.W. Cross-View Self-Fusion for Self-Supervised 3D Human Pose Estimation in the Wild. In Proceedings of the 2022 Asian Conference on Computer Vision (ACCV), Macau, China, 4–8 December 2022; pp. 1385–1402. [Google Scholar]
- Hua, G.; Liu, H.; Li, W.; Zhang, Q.; Ding, R.; Xu, X. Weakly-supervised 3D Human Pose Estimation with Cross-view U-shaped Graph Convolutional Network. IEEE Trans. Multimed. 2022, 25, 1832–1843. [Google Scholar] [CrossRef]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
- Lie, W.N.; Vann, V. 3D Human Skeleton Estimation from Single RGB Image Based on Fusion of Predicted Depths from Multiple Virtual-Viewpoints. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 719–725. [Google Scholar]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral human pose regression. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Zou, Z.; Tang, W. Modulated Graph Convolutional Network for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11477–11487. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
- Cai, J.; Liu, H.; Ding, R.; Li, W.; Wu, J.; Ban, M. HTNet: Human Topology Aware Network for 3D Human Pose Estimation. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Wu, Y.; Ma, S.; Zhang, D.; Huang, W.; Chen, Y. An Improved Mixture Density Network for 3D Human Pose Estimation with Ordinal Ranking. Sensors 2022, 22, 4987. [Google Scholar] [CrossRef]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11656–11665. [Google Scholar]
- Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation with Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
- Li, W.; Du, R.; Chen, S. Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video. Sensors 2022, 22, 2573. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 13147–13156. [Google Scholar]
- Lee, D.; Kim, J. HDPose: Post-Hierarchical Diffusion with Conditioning for 3D Human Pose Estimation. Sensors 2024, 24, 829. [Google Scholar] [CrossRef]
- Wang, H.; Quan, W.; Zhao, R.; Zhang, M.; Jiang, N. Learning Temporal–Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation. Sensors 2024, 24, 4422. [Google Scholar] [CrossRef] [PubMed]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Su, W.; Wang, Z. SimplePose: Rethinking and Improving a Bottom-Up Approach for Multi-Person Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11354–11361. [Google Scholar]
Number of Virtual Viewpoints (N) | MPJPE (mm) | PA-MPJPE (mm) |
---|---|---|
0 (real viewpoint only) | 50.6 | 38.1 |
1 | 49.2 | 37.5 |
2 | 49.2 | 37.6 |
3 | 49.2 | 37.7 |
4 | 49.3 | 37.8 |
5 | 49.3 | 37.6 |
6 | 49.3 | 37.6 |
7 | 49.2 | 37.7 |
8 | 50.0 | 38.3 |
9 | 49.3 | 37.7 |
10 | 49.3 | 37.7 |
2 (v_left90, v_right90) | 49.0 | 37.5 |
MPJPE (mm) | PA-MPJPE (mm) | ||
---|---|---|---|
0 (none) | 0 (none) | 49.00 | 37.5 |
0.010 | 0 | 47.90 | 37.0 |
0.02 | 0 | 47.57 | 37.1 |
0.022 | 0 | 47.52 | 37.2 |
0.025 | 0 | 47.61 | 37.4 |
0.030 | 0 | 47.93 | 37.6 |
0.040 | 0 | 48.81 | 38.7 |
0.022 | 0.001 | 47.60 | 37.2 |
0.022 | 0.005 | 47.50 | 37.0 |
0.022 | 0.010 | 47.48 | 37.5 |
0.022 | 0.015 | 47.78 | 37.3 |
0.022 | 0.020 | 47.81 | 37.1 |
Human’s Position | Distance (pixel) | MPJPE (mm) (w/o COCT) | MPJPE (mm) (with COCT) | Improvement (mm) |
---|---|---|---|---|
Small (34.17%%) | 0–99 | 51.01 | 50.14 | 0.87 |
Medium (61.74%) | 100–249 | 45.82 | 44.95 | 0.87 |
Large (4.09%) | 250–375 | 43.03 | 41.57 | 1.46 |
Embedding Network | Fusion Network | MPJPE (mm) | PA-MPJPE (mm) | Model Size (MB) |
---|---|---|---|---|
MLP | DenseFC | 45.78 | 36.61 | 590.6 |
MLP | MLP | 46.82 | 37.07 | 360.0 |
MLP | GCN | 47.60 | 37.87 | 219.9 |
GCN | DenseFC | 46.12 | 36.61 | 313.1 |
GCN | MLP | 46.77 | 37.23 | 309.4 |
GCN | GCN | 46.43 | 37.18 | 1.5 |
Real Viewpoint | Virtual Viewpoints | Depth Denoising | COCT | Embedding Network | Fusion Network | MPJPE (mm) | PA-MPJPE (mm) |
---|---|---|---|---|---|---|---|
✓ | ✓ | 50.6 | 38.1 | ||||
✓ | ✓ | ✓ | 49.0 | 37.5 | |||
✓ | ✓ | ✓ | ✓ | 47.5 | 37.5 | ||
✓ | ✓ | ✓ | ✓ | ✓ | 46.6 | 36.9 | |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 45.7 | 36.6 |
Protocol #1 | Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT | Avg. |
CEE-Net [19] (T = 1) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 47.3 |
Zou et al. [29] (T = 1) | 45.4 | 49.2 | 45.7 | 49.4 | 50.4 | 58.2 | 47.9 | 46.0 | 57.5 | 63.0 | 49.7 | 46.6 | 52.2 | 38.9 | 40.8 | 49.4 |
Lifting by Image [16] (T = 1) | 44.9 | 46.4 | 42.4 | 44.9 | 48.7 | 40.1 | 44.3 | 55.0 | 58.9 | 47.1 | 48.2 | 42.6 | 36.9 | 48.8 | 40.1 | 46.4 |
LCMDN [34] (T = 1) | 42.0 | 47.1 | 44.5 | 48.2 | 54.5 | 58.1 | 44.0 | 45.8 | 57.9 | 71.4 | 52.0 | 48.7 | 52.7 | 41.3 | 42.3 | 50.0 |
HTNet [33] (T = 1) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 47.6 |
HTNet [33] (T = 27) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 46.1 |
MHFormer [35] (T = 351) | 39.2 | 43.1 | 40.1 | 40.9 | 44.9 | 51.2 | 40.6 | 41.3 | 53.5 | 60.3 | 43.7 | 41.1 | 43.8 | 29.8 | 30.6 | 43.0 |
PoseFormer [36] (T = 81) | 41.5 | 44.8 | 39.8 | 42.5 | 46.5 | 51.6 | 42.1 | 42.0 | 53.3 | 60.7 | 45.5 | 43.3 | 46.1 | 31.8 | 32.2 | 44.3 |
PoseFormer [36] (T = 27) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 47.0 |
Chen et al. [37] (T = 243) | 41.4 | 43.5 | 40.1 | 42.9 | 46.6 | 51.9 | 41.7 | 42.3 | 53.9 | 60.2 | 45.4 | 41.7 | 46.0 | 31.5 | 32.7 | 44.1 |
Chen et al. [37] (T = 9) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 46.3 |
Lie et al. [8] (T = 31) | 40.8 | 46.0 | 41.3 | 57.1 | 47.0 | 52.8 | 39.9 | 42.3 | 55.2 | 72.4 | 44.7 | 53.3 | 47.7 | 33.3 | 34.9 | 47.3 |
HDPose [38] (T = 243) | 37.8 | 40.7 | 37.7 | 39.6 | 42.4 | 50.2 | 39.8 | 40.2 | 51.8 | 55.8 | 42.2 | 39.8 | 41.0 | 27.9 | 28.1 | 41.0 |
DASTFormer [39] (T = 243) | 36.8 | 39.7 | 39.3 | 34.3 | 40.9 | 50.6 | 36.8 | 36.7 | 50.9 | 59.0 | 41.4 | 38.4 | 37.9 | 25.3 | 25.8 | 39.6 |
STUNet [40] (T = 27) | 43.5 | 44.8 | 43.9 | 44.1 | 47.7 | 56.5 | 44.0 | 44.2 | 55.8 | 67.9 | 47.3 | 46.5 | 45.7 | 33.4 | 33.6 | 46.6 |
Ours (T = 1) | 36.3 | 42.8 | 40.2 | 57.5 | 44.7 | 48.2 | 37.2 | 38.9 | 53.9 | 74.1 | 42.7 | 55.0 | 43.7 | 32.7 | 34.1 | 45.7 |
Protocol #2 | Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD | Walk | WalkT | Avg. |
CEE-Net [19] (T = 1) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 36.8 |
Zou et al. [29] (T = 1) | 35.7 | 38.6 | 36.3 | 40.5 | 39.2 | 44.5 | 37.0 | 35.4 | 46.4 | 51.2 | 40.5 | 35.6 | 41.7 | 30.7 | 33.9 | 39.1 |
HTNet [33] (T = 1) | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 38.6 |
MHFormer [35] (T = 351) | 31.5 | 34.9 | 32.8 | 33.6 | 35.3 | 39.6 | 32.0 | 32.2 | 43.5 | 48.7 | 36.4 | 32.6 | 34.3 | 23.9 | 25.1 | 34.4 |
PoseFormer [36] (T = 81) | 32.5 | 34.8 | 32.6 | 34.6 | 35.3 | 39.5 | 32.1 | 32.0 | 42.8 | 48.5 | 34.8 | 32.4 | 35.3 | 24.5 | 26.0 | 34.6 |
Chen et al. [37] (T = 243) | 32.6 | 35.1 | 32.8 | 35.4 | 36.3 | 40.4 | 32.4 | 32.3 | 42.7 | 49.0 | 36.8 | 32.4 | 36.0 | 24.9 | 26.5 | 35.0 |
Lie et al. [8] (T = 31) | 31.6 | 35.1 | 34.2 | 38.6 | 37.3 | 38.9 | 31.3 | 32.9 | 45.2 | 51.1 | 36.3 | 37.0 | 35.8 | 24.8 | 26.9 | 35.8 |
HDPose [38] (T = 243) | 31.0 | 33.2 | 30.6 | 31.9 | 33.2 | 39.2 | 31.1 | 30.7 | 42.5 | 45.0 | 34.1 | 30.7 | 32.5 | 22.0 | 23.0 | 32.8 |
DASTFormer [39] (T = 243) | 31.1 | 33.7 | 33.8 | 29.4 | 34.0 | 39.6 | 30.3 | 31.4 | 43.5 | 49.7 | 36.0 | 31.3 | 32.8 | 22.0 | 22.6 | 33.4 |
STUNet [40] (T = 27) | 34.3 | 35.7 | 34.9 | 36.6 | 37.5 | 42.7 | 33.1 | 36.0 | 44.4 | 53.7 | 38.5 | 33.5 | 38.4 | 26.0 | 28.4 | 36.9 |
Ours (T = 1) | 31.0 | 34.8 | 34.8 | 39.9 | 37.6 | 38.8 | 31.1 | 32.0 | 46.4 | 52.9 | 37.1 | 38.3 | 36.2 | 27.0 | 29.8 | 36.6 |
Model | Model Size (MB) | GFLOPs | MPJPE (mm) |
---|---|---|---|
Chen et al. [37] (T = 9) | 531 (CPN) + 903 | NA | 46.3 |
HTNet [33] (T = 27) | 531 (CPN) + 11.6 | NA | 46.1 |
MHFormer [35] (T = 351) | 531 (CPN) + 120 | NA | 43.0 |
Ours | 405.9 (Real-Net) + 528.1 (Virtual-Net) + 590.6 (FM, MLP + DenseFC) | 18.17 | 45.7 |
Ours | 405.9 (Real-Net) + 528.1 (Virtual-Net) + 1.5 (FM, GCN + GCN) | 18.13 | 46.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lie, W.-N.; Vann, V. Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints. Sensors 2024, 24, 8017. https://doi.org/10.3390/s24248017
Lie W-N, Vann V. Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints. Sensors. 2024; 24(24):8017. https://doi.org/10.3390/s24248017
Chicago/Turabian StyleLie, Wen-Nung, and Veasna Vann. 2024. "Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints" Sensors 24, no. 24: 8017. https://doi.org/10.3390/s24248017
APA StyleLie, W. -N., & Vann, V. (2024). Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints. Sensors, 24(24), 8017. https://doi.org/10.3390/s24248017