Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features
<p>The top is the RGB image, and the bottom is the head pose label corresponding to the RGB image. (<b>a</b>) When the yaw angle is 38.27°, the discretized results of the head pose. (<b>b</b>) When the yaw angle is −40.39°, the discretized results of the head pose.</p> "> Figure 2
<p>Head pose estimation network. Different colors represent the characteristics of different stages. The network is mainly divided into four modules: feature function, fusion function, score function, and predict function.</p> "> Figure 3
<p>Feedforward residual MLP module.</p> "> Figure 4
<p>Spatial self-attention module.</p> "> Figure 5
<p>Feature extraction module. The network consists of five residual point blocks, three attention blocks, and a feature transform.</p> "> Figure 6
<p>Classification and regression module.</p> "> Figure 7
<p>Data source. (<b>a</b>) RGB image. The head position in the RGB image. (<b>b</b>) Head mask. In the head mask, the white region is the head, while the black region is the background. (<b>c</b>) Point cloud. From the depth map to the point cloud.</p> "> Figure 8
<p>Point clouds at different scales. (<b>a</b>) Primary point cloud. (<b>b</b>) Downsampling the original point cloud to 1024. (<b>c</b>) Downsampling the original point cloud to 512.</p> "> Figure 9
<p>When comparing model prediction accuracy, within the same dataset, between instances with and without positional encoding. (<b>a</b>) The predictive accuracy of the model on yaw angles. (<b>b</b>) The predictive accuracy of the model on pitch angles. (<b>c</b>) The predictive accuracy of the model on roll angles.</p> "> Figure 10
<p>When using the 11th and 12th instances as the dataset, a comparison between the model’s predicted values and the ground truth. (<b>a</b>) Comparison on yaw angles. (<b>b</b>) Comparison on pitch angles. (<b>c</b>) Comparison on roll angles.</p> "> Figure 11
<p>When each set of data in the dataset is used as a test set, the model’s predicted results are compared to the ground truth, and the average absolute error is calculated. The red dotted line represents the upper limit of the model’s prediction accuracy.</p> "> Figure 12
<p>Comparison of different methods on the BIWI dataset. (<b>a</b>) The comparison of different methods in terms of yaw angle prediction accuracy. (<b>b</b>) The comparison of different methods in terms of pitch angle prediction accuracy. (<b>c</b>) The comparison of different methods in terms of roll angle prediction accuracy.</p> "> Figure 13
<p>Visualization of partial test set results. The blue, green, and red colors respectively represent yaw, pitch, and roll angles. The top and bottom sections show the RGB and point cloud visualizations, respectively.</p> ">
Abstract
:1. Introduction
2. Discretization of Head Pose Labels
3. Head Pose Estimation Network
3.1. Feature Function Module
3.2. Fusion Function Module
3.3. Score and Prediction Function Module
4. Experimental Results
4.1. Data Source
4.2. Experimental Parameters
4.3. Experimental Results
4.3.1. Data Processing Result
4.3.2. Ablation Experiment Results
4.3.3. Model Evaluation and Comparison
5. Discussion
5.1. Comparison with Existing Methods
5.2. Future Research
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Rossi, S.; Leone, E.; Staffa, M. Using random forests for the estimation of multiple users’ visual focus of attention from head pose. In Proceedings of the XV of AI* IA 2016 Advances in Artificial Intelligence: XVth International Conference of the Italian Association for Artificial Intelligence, Genova, Italy, 29 November–1 December 2016. [Google Scholar]
- Huang, S.; Yang, K.; Xiao, H.; Han, P.; Qiu, J.; Peng, L.; Liu, D.; Luo, K. A new head pose tracking method based on stereo visual SLAM. J. Vis. Commun. Image Represent. 2022, 82, 103402. [Google Scholar] [CrossRef]
- Liu, H.; Liu, T.; Zhang, Z.; Sangaiah, A.K.; Yang, B.; Li, Y. ARHPE: Asymmetric Relation-Aware Representation Learning for Head Pose Estimation in Industrial Human-Computer Interaction. IEEE Trans. Ind. Inf. 2022, 18, 7107–7117. [Google Scholar] [CrossRef]
- Avola, D.; Cinque, L.; Del Bimbo, A.; Marini, M.R. MIFTel: A multimodal interactive framework based on temporal logic rules. Multimed. Tools Appl. 2020, 79, 13533–13558. [Google Scholar] [CrossRef]
- Liu, H.; Nie, H.; Zhang, Z.; Li, Y.-F. Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 2021, 433, 310–322. [Google Scholar] [CrossRef]
- Wongphanngam, J.; Pumrin, S. Fatigue warning system for driver nodding off using depth image from Kinect. In Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, Chiang Mai, Thailand, 28 June–1 July 2016; pp. 1–6. [Google Scholar]
- Baltrušaitis, T.; Robinson, P.; Morency, L.P. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
- Han, J.; Luo, K.; Qiu, J.; Liu, D.; Peng, L.; Han, P. Head attitude estimation method of eye tracker based on binocular camera. Adv. Laser Optoelectron. 2019, 58, 310–317. [Google Scholar]
- Zhao, G.; Chen, L.; Song, J.; Chen, G. Large head movement tracking using sift-based registration. In Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, 25–29 September 2007; pp. 807–810. [Google Scholar]
- Liu, L.; Ke, Z.; Huo, J.; Chen, J. Head pose estimation through keypoints matching between reconstructed 3D face model and 2D image. Sensors 2021, 21, 1841. [Google Scholar] [CrossRef]
- Liu, H.; Zhang, C.; Deng, Y.; Liu, T.; Zhang, Z.; Li, Y.F. Orientation Cues-Aware Facial Relationship Representation for Head Pose Estimation via Transformer. IEEE Trans. Image Process. 2023, 32, 6289–6302. [Google Scholar] [CrossRef]
- Geng, X.; Qian, X.; Huo, Z.; Zhang, Y. Head pose estimation based on multivariate label distribution. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1974–1991. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, H.; Deng, Y.; Xie, B.; Li, Y. TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 8897–8906. [Google Scholar]
- Liu, H.; Fang, S.; Zhang, Z.; Li, D.; Lin, K.; Wang, J. MFDNet: Collaborative Poses Perception and Matrix Fisher Distribution for Head Pose Estimation. IEEE Trans. Multimedia 2022, 24, 2449–2460. [Google Scholar] [CrossRef]
- Ruiz, N.; Chong, E.; Rehg, J.M. Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2074–2083. [Google Scholar]
- Yang, T.-Y.; Chen, Y.-T.; Lin, Y.-Y.; Chuang, Y.-Y. FSA-Net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1087–1096. [Google Scholar]
- Zhang, H.; Zhang, Y.; Geng, X. Practical age estimation using deep label distribution learning. Front. Comput. Sci. 2021, 15, 153318. [Google Scholar] [CrossRef]
- Liu, T.; Wang, J.; Yang, B.; Wang, X. NGDNet: Nonuniform Gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 2021, 436, 210–220. [Google Scholar] [CrossRef]
- Xu, L.; Chen, J.; Gan, Y. Head pose estimation with soft labels using regularized convolutional neural network. Neurocomputing 2017, 337, 339–353. [Google Scholar] [CrossRef]
- Chenglong, L.; Fan, Z.; Xin, M.; Xeuying, Q. Real-time head attitude estimation based on Kalman filter and random regression forest. J. Comput. Aid. Des. Graph. 2017, 29, 2309–2316. [Google Scholar]
- Wang, Y.; Yuan, G.; Fu, X. Driver’s head pose and gaze zone estimation based on multi-zone templates registration and multi-frame point cloud fusion. Sensors 2022, 22, 3154. [Google Scholar] [CrossRef]
- Shihua, X.; Nan, S.; Xupeng, W. 3D point cloud head attitude estimation based on Deep learning. J. Comput. Appl. 2020, 40, 996–1001. (In Chinese) [Google Scholar]
- Xu, Y.; Jung, C.; Chang, Y. Head pose estimation using deep neural networks and 3D point clouds. Pattern Recog. 2022, 121, 108210. [Google Scholar] [CrossRef]
- Zhang, Y.; Fu, K.; Wang, J.; Cheng, P. Learning from discrete Gaussian label distribution and spatial channel-aware residual attention for head pose estimation. Neurocomputing 2020, 407, 259–269. [Google Scholar] [CrossRef]
- Gumbel, E.J. Les valeurs extrêmes des distributions statistiques. Ann. De L’Institut Henri Poincaré 1935, 5, 115–158. [Google Scholar]
- Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Charles, R.Q.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chevtchenko, S.F.; Vale, R.F.; Macario, V.; Cordeiro, F.R. A convolutional neural network with feature fusion for real-time hand posture recognition. Appl. Soft Comput. 2018, 73, 748–766. [Google Scholar] [CrossRef]
- Zhou, W.; Dong, S.; Lei, J.; Yu, L. MTANet: Multitask-aware network with hierarchical multimodal fusion for RGB-T urban scene understanding. IEEE Trans. Intell. Veh. 2022, 8, 48–58. [Google Scholar] [CrossRef]
- Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Liu, H.; Wang, X.; Zhang, W.; Zhang, Z.; Li, Y.-F. Infrared head pose estimation with multi-scales feature fusion on the IRHP database for human attention recognition. Neurocomputing 2020, 411, 510–520. [Google Scholar] [CrossRef]
- Fanelli, G.; Gall, J.; Gool, L.V. Real time head pose estimation with random regression forests. In Proceedings of the Conference on Computer Vision and Pattern Recognition 2011, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
- Xu, X.; Kakadiaris, I.A. Joint head pose estimation and face alignment framework using global and local CNN features. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017. [Google Scholar]
- Wang, Y.; Liang, W.; Shen, J.; Jia, Y.; Yu, L.-F. A deep coarse-to-fine network for head pose estimation from synthetic data. Pattern Recog. 2019, 94, 196–206. [Google Scholar] [CrossRef]
- Borghi, G.; Fabbri, M.; Vezzani, R.; Calderara, S.; Cucchiara, R. Face-from-depth for head pose estimation on depth images. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 596–609. [Google Scholar] [CrossRef]
- Meyer, G.P.; Gupta, S.; Frosio, I.; Reddy, D.; Kautz, J. Robust model-based 3D head pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Yaw (°) | Pitch (°) | Roll (°) | Mean (°) | |
---|---|---|---|---|
0.10 | 1.67 | 1.85 | 2.26 | 1.92 |
1.00 | 1.38 | 1.34 | 1.76 | 1.49 |
2.00 | 1.36 | 1.43 | 1.85 | 1.54 |
Method | Yaw (°) | Pitch (°) | Roll (°) | Mean (°) |
---|---|---|---|---|
S-A + P + R | 1.38 | 1.34 | 1.76 | 1.49 |
S-A + P | 1.62 | 2.05 | 2.07 | 1.91 |
P | 1.44 | 2.59 | 2.17 | 2.06 |
Data | Yaw (°) | Pitch (°) | Roll (°) | Mean (°) | |
---|---|---|---|---|---|
FineGrained [15] | RGB | 3.29 | 3.39 | 3.30 | 3.23 |
FSANet [16] | RGB | 4.96 | 4.27 | 2.76 | 4.00 |
PGCNN [23] | Point | 1.82 | 1.09 | 1.39 | 1.42 |
Multi-task [36] | RGB | 4.30 | 3.60 | 3.40 | 3.76 |
CoarseFine [37] | RGB | 4.76 | 5.48 | 4.29 | 4.84 |
POSEidon [38] | Depth | 1.70 | 1.60 | 1.80 | 1.70 |
RobustMode [39] | Depth | 2.40 | 2.20 | 2.10 | 2.10 |
Our (Gauss) | RGB + Point | 1.12 | 0.94 | 0.74 | 0.93 |
Our (Gumbel) | RGB + Point | 1.18 | 0.67 | 0.68 | 0.84 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, K.; Wu, Z.; Huang, J.; Su, Y. Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features. Sensors 2023, 23, 9894. https://doi.org/10.3390/s23249894
Chen K, Wu Z, Huang J, Su Y. Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features. Sensors. 2023; 23(24):9894. https://doi.org/10.3390/s23249894
Chicago/Turabian StyleChen, Kui, Zhaofu Wu, Jianwei Huang, and Yiming Su. 2023. "Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features" Sensors 23, no. 24: 9894. https://doi.org/10.3390/s23249894
APA StyleChen, K., Wu, Z., Huang, J., & Su, Y. (2023). Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features. Sensors, 23(24), 9894. https://doi.org/10.3390/s23249894