Comprehensive features with randomized decision forests for hand segmentation from color images in uncontrolled indoor scenarios

Manu Martin ORCID: orcid.org/0000-0002-1056-0912¹,
Thang Nguyen¹,
Shahrouz Yousefi² &
…
Bo Li¹

274 Accesses
3 Citations
Explore all metrics

Abstract

Hand segmentation is an integral part of many computer vision applications, especially gesture recognition. Training a classifier to classify pixels into hand or background using skin color as a feature is one of the most popular methods for this purpose. This approach has been highly restricted to simple hand segmentation scenarios since color feature alone provides very limited information for classification. Meanwhile there have been a rise of segmentation methods utilizing deep learning networks to exploit multi-layers of complex features learned from image data. Yet a deep neural network requires a large database for training and a powerful computational machine for operations due to its complexity in computations. In this work, the development of comprehensive features and optimized uses of these features with a randomized decision forest (RDF) classifier for the task of hand segmentation in uncontrolled indoor environments is investigated. Newly designed image features and new implementations are provided with evaluations of their hand segmentation performances. In total, seven image features which extract pixel or neighborhood related properties from color images are proposed and evaluated individually as well as in combination. The behaviours of feature and RDF parameters are also evaluated and optimum parameters for the scenario under consideration are identified. Additionally, a new dataset containing hand images in uncontrolled indoor scenarios was created during this work. It was observed from the research that a combination of features extracting color, texture, neighborhood histogram and neighborhood probability information outperforms existing methods for hand segmentation in restricted as well as unrestricted indoor environments using just a small training dataset. Computations required for the proposed features and the RDF classifier are light, hence the segmentation algorithm is suited for embedded devices equipped with limited power, memory, and computational capacities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

A robust multi-scale deep learning approach for unconstrained hand detection aided by skin segmentation

Article 18 May 2021

HySeg-Net: A Robust Interactive Hybrid Technique for Image Segmentation and Classification in Hand Gesture Recognition

A Soft Computing Based Approach for Pixel Labelling on 2D Images Using Fine Tuned R-CNN

Notes

References

Albiol A, Torres L, Delp EJ (2001) Optimum color spaces for skin detection. In: Proceedings of the IEEE international conference on image processing, vol 1. IEEE, pp 122–124
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Criminisi A, Shotton J (2013) Decision forests for computer vision and medical image analysis. Springer Science & Business Media, Berlin
Book Google Scholar
Davies ER (2004) Machine vision: theory, algorithms, practicalities. Elsevier, Amsterdam
Google Scholar
Garg P, Aggarwal N, Sofat S (2009) Vision based hand gesture recognition. World Acad Sci Eng Technol 49(1):972–977
Google Scholar
Goldin-Meadow S (1999) The role of gesture in communication and thinking. Trends Cogn Sci 3(11):419–429
Article Google Scholar
Grzejszczak T, Kawulok M, Galuszka A (2016) Hand landmarks detection and localization in color images. Multimed Tools Appl 75(23):16,363–16,387
Article Google Scholar
Guo Y, Liu Y, Georgiou T, Lew MS (2018) A review of semantic segmentation using deep neural networks. Int J Multimed Inf Retrieval 7(2):87–93
Article Google Scholar
Jain AK, Farrokhnia F (1991) Unsupervised texture segmentation using gabor filters. Pattern Recogn 24(12):1167–1186
Article Google Scholar
Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skin-color modeling and detection methods. Pattern Recogn 40(3):1106–1122
Article MATH Google Scholar
Karam M (2009) A framework for gesture-based human computer interactions. VDM Verlag, Saarbrücken
Google Scholar
Kawulok M, Kawulok J, Nalepa J, Smolka B (2014) Self-adaptive algorithm for segmenting skin regions. EURASIP J Adv Signal Process 2014(170):1–22
Google Scholar
Khan R, Hanbury A, Stoettinger J (2010) Skin detection: a random forest approach. In: Proceedings of the IEEE international conference on image processing. IEEE, pp 4613–4616
Khan R, Hanbury A, Stöttinger J, Bais A (2012) Color based skin classification. Pattern Recogn Lett 33(2):157–163
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems, NIPS’12, vol 1. Curran Associates Inc., New York, pp 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
Laws KI (1980) Rapid texture identification. In: Proceedings of SPIE - the international society for optical engineering, vol 238, pp 376–381
Li C, Kitani KM (2013) Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3570–3577
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Nalepa J, Kawulok M (2014) Fast and accurate hand shape classification. In: Proceedings of the international conference: beyond databases, architectures and structures. Springer, pp 364–373
Oghaz MM, Maarof MA, Zainal A, Rohani MF, Yaghoubyan SH (2015) A hybrid color space for skin detection using genetic algorithm heuristic search and principal component analysis technique. PLOS One 10(8):1–21
Google Scholar
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54
Article Google Scholar
Sarkar AR, Sanyal G, Majumder S (2013) Hand gesture recognition systems: a survey. Int J Comput Appl 71(15):25–37
Google Scholar
Saxena A, Chung SH, Ng AY (2006) Learning depth from single monocular images. In: Proceedings of the international conference on neural information processing system, pp 1161–1168
Schroff F, Criminisi A, Zisserman A (2008) Object class segmentation using random forests. In: Proceedings of the British machine vision conference, pp 1–10
Serra G, Camurri M, Baraldi L, Benedetti M, Cucchiara R (2013) Hand segmentation for gesture recognition in ego-vision. In: Proceedings of the 3rd ACM international workshop on interactive multimedia on mobile & portable devices. ACM, pp 31–36
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 56. IEEE, pp 1297–1304
Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intel 35 (12):2821–2840
Article Google Scholar
Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Proceedings of the 9th European conference on computer vision. Springer, pp 1–15
Ungureanu AS, Bazrafkan S, Corcoran P (2018) Deep learning for hand segmentation in complex backgrounds. In: Proceedings of the the IEEE conference on consumer electronics. IEEE, pp 1–2
Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proceedings of the 13th international conference on computer graphics and vision, vol 3. MSU, pp 85–92
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 1. IEEE, pp 511–518
Vodopivec T, Lepetit V, Peer P (2016) Fine hand segmentation using convolutional neural networks. arXiv
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional lstms. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 988–997
Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: Proceedings of the international conference on tools with artificial intelligence. IEEE, pp 234–241
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276
Article Google Scholar
Wang C, Yang H, Meinel C (2016) Exploring multimodal video representation for action recognition. In: Proceedings of the international joint conference on neural networks. IEEE, pp 1924–1931
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. ACM Trans Multimed Comput Commun Appl 14(2s):40
Google Scholar
Wang Q, Gao J, Yuan Y (2018) Embedding structured contour and location prior in siamesed fully convolutional networks for road detection. IEEE Trans Intell Transp Syst 19(1):230–241
Article Google Scholar
Wang Q, Gao J, Yuan Y (2018) A joint convolutional neural networks and context transfer for street scenes labeling. IEEE Trans Intell Transp Syst 19(5):1457–1470
Article Google Scholar
Winn J, Criminisi A (2006) Object class recognition at a glance. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE
Winn J, Shotton J (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition, vol 1. IEEE, pp 37–44
Zabulis X, Baltzakis H, Argyros AA (2009) Vision-based hand gesture recognition for human-computer interaction. 30–88. LEA
Zhu X, Jia X, Wong KYK (2014) Pixel-level hand detection with shape-aware structured forests. In: Proceedings of the Asian conference on computer vision. Springer, pp 64–78
Zhu X, Jia X, Wong KYK (2015) Structured forests for pixel-level hand detection and hand part labelling. Comput Vis Image Underst 141:95–107
Article Google Scholar

Download references

Author information

Authors and Affiliations

ManoMotion AB, Stockholm, Sweden
Manu Martin, Thang Nguyen & Bo Li
Department of Media Technology, Linnaeus University, Växjö, Sweden
Shahrouz Yousefi

Authors

Manu Martin
View author publications
You can also search for this author in PubMed Google Scholar
Thang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Shahrouz Yousefi
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manu Martin.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Features

1.1 A.1 Comparison of the new implementation to estimate texture using gabor filters

Figure 20 provides a comparison of the filter response using the new implementation and the original method for one of the selected scale and orientation. The error in this case was 0.02% in mean square sense.

Appendix B: Optimization of feature parameters

1.1 B.1 Gabor texture feature

As explained in Section 3.2, there are two parameters associated with the Gabor texture feature: number of scales (nScale) and number of orientations (nOrient). Figure 21 shows how nScale affects the segmentation performance and feature extraction time. The maximum possible value for nScale is 5, limited by image dimensions and the scale down operation used for the faster implementation. It can be observed from the figure that the precision increased with number of scales for both scenarios. On the other hand, the behaviour of recall was different between the databases. The time required for feature estimation showed almost linear relationship with the number scales. nScale = 3 gave good performances in both scenarios.

The effect of nOrient on the segmentation output is shown in Fig. 22. Both scenarios displayed convergence of the evaluation measures when 8 orientations were used. The time for feature extraction varied linearly with the number of orientations because of the corresponding increase in filtering operations required at each scale.

B.2 Laws texture feature

Number of scales (nScale) and filter width (fWidth) are the two parameters of Laws texture approach. The influence of nScale on segmentation output when fWidth = 3 and fWidth = 5 is shown in Figs. 23 and 24 respectively. The effects of number of scales on segmentation performances were observed to be very similar to that of the Gabor texture method, with optimum results achieved for the nScale value of 3. The segmentation outputs when the filters used were of size 5x5 were slightly better than 3x3 filters.

B.3 Neighborhood difference feature

Effects of the three parameters neiCount, neiSpace and neiOrient on the segmentation performance when the HSV color space was used are shown in Figs. 25, 26 and 27 respectively. Increasing the number of neighbors improved all three evaluation measures for both scenarios at the expense of the time required for feature estimation which increased linearly. In the case of neiSpace, all the evaluation measures converged to a maximum for the value 5. The time was unaffected by this parameter. Lastly, all the evaluation measures showed convergence for a neiOrient value of 8. The time for feature estimation showed an almost linear relationship with number of neighbor orientations.

B.4 Neighbourhood histogram feature

The effects of nBins and hWidth parameters of the neighborhood histogram feature on segmentation performance using the H channel are shown in Figs. 28 and 29 respectively. In the case of nBins, all evaluation measures converge to a maximum for both scenarios. The time requirement varied linearly with increasing number of bins. On the other hand, the hWidth parameter showed a decrease in recall and f-score after crossing a limit value. Due to the custom implementation used, the time was unaffected by this parameter.

B.5 Neighborhood probability feature

The effects of different parameters of the neighborhood probability feature on the segmentation performance are given in Figs. 30, 31 and 32. The behaviour is found to be similar to that of the neighborhood difference feature.

Appendix C: Evaluation of other feature combinations

Figure 33 shows the evaluation results of feature combinations which are not listed in Section 4.4. The patterns visible in these results are similar to the ones identified earlier.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Martin, M., Nguyen, T., Yousefi, S. et al. Comprehensive features with randomized decision forests for hand segmentation from color images in uncontrolled indoor scenarios. Multimed Tools Appl 78, 20987–21020 (2019). https://doi.org/10.1007/s11042-019-7445-3

Download citation

Received: 23 March 2018
Revised: 25 January 2019
Accepted: 31 January 2019
Published: 11 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11042-019-7445-3