Abstract
Hand segmentation is an integral part of many computer vision applications, especially gesture recognition. Training a classifier to classify pixels into hand or background using skin color as a feature is one of the most popular methods for this purpose. This approach has been highly restricted to simple hand segmentation scenarios since color feature alone provides very limited information for classification. Meanwhile there have been a rise of segmentation methods utilizing deep learning networks to exploit multi-layers of complex features learned from image data. Yet a deep neural network requires a large database for training and a powerful computational machine for operations due to its complexity in computations. In this work, the development of comprehensive features and optimized uses of these features with a randomized decision forest (RDF) classifier for the task of hand segmentation in uncontrolled indoor environments is investigated. Newly designed image features and new implementations are provided with evaluations of their hand segmentation performances. In total, seven image features which extract pixel or neighborhood related properties from color images are proposed and evaluated individually as well as in combination. The behaviours of feature and RDF parameters are also evaluated and optimum parameters for the scenario under consideration are identified. Additionally, a new dataset containing hand images in uncontrolled indoor scenarios was created during this work. It was observed from the research that a combination of features extracting color, texture, neighborhood histogram and neighborhood probability information outperforms existing methods for hand segmentation in restricted as well as unrestricted indoor environments using just a small training dataset. Computations required for the proposed features and the RDF classifier are light, hence the segmentation algorithm is suited for embedded devices equipped with limited power, memory, and computational capacities.
Similar content being viewed by others
References
Albiol A, Torres L, Delp EJ (2001) Optimum color spaces for skin detection. In: Proceedings of the IEEE international conference on image processing, vol 1. IEEE, pp 122–124
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Criminisi A, Shotton J (2013) Decision forests for computer vision and medical image analysis. Springer Science & Business Media, Berlin
Davies ER (2004) Machine vision: theory, algorithms, practicalities. Elsevier, Amsterdam
Garg P, Aggarwal N, Sofat S (2009) Vision based hand gesture recognition. World Acad Sci Eng Technol 49(1):972–977
Goldin-Meadow S (1999) The role of gesture in communication and thinking. Trends Cogn Sci 3(11):419–429
Grzejszczak T, Kawulok M, Galuszka A (2016) Hand landmarks detection and localization in color images. Multimed Tools Appl 75(23):16,363–16,387
Guo Y, Liu Y, Georgiou T, Lew MS (2018) A review of semantic segmentation using deep neural networks. Int J Multimed Inf Retrieval 7(2):87–93
Jain AK, Farrokhnia F (1991) Unsupervised texture segmentation using gabor filters. Pattern Recogn 24(12):1167–1186
Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skin-color modeling and detection methods. Pattern Recogn 40(3):1106–1122
Karam M (2009) A framework for gesture-based human computer interactions. VDM Verlag, Saarbrücken
Kawulok M, Kawulok J, Nalepa J, Smolka B (2014) Self-adaptive algorithm for segmenting skin regions. EURASIP J Adv Signal Process 2014(170):1–22
Khan R, Hanbury A, Stoettinger J (2010) Skin detection: a random forest approach. In: Proceedings of the IEEE international conference on image processing. IEEE, pp 4613–4616
Khan R, Hanbury A, Stöttinger J, Bais A (2012) Color based skin classification. Pattern Recogn Lett 33(2):157–163
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems, NIPS’12, vol 1. Curran Associates Inc., New York, pp 1097–1105. http://dl.acm.org/citation.cfm?id=2999134.2999257
Laws KI (1980) Rapid texture identification. In: Proceedings of SPIE - the international society for optical engineering, vol 238, pp 376–381
Li C, Kitani KM (2013) Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3570–3577
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Nalepa J, Kawulok M (2014) Fast and accurate hand shape classification. In: Proceedings of the international conference: beyond databases, architectures and structures. Springer, pp 364–373
Oghaz MM, Maarof MA, Zainal A, Rohani MF, Yaghoubyan SH (2015) A hybrid color space for skin detection using genetic algorithm heuristic search and principal component analysis technique. PLOS One 10(8):1–21
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54
Sarkar AR, Sanyal G, Majumder S (2013) Hand gesture recognition systems: a survey. Int J Comput Appl 71(15):25–37
Saxena A, Chung SH, Ng AY (2006) Learning depth from single monocular images. In: Proceedings of the international conference on neural information processing system, pp 1161–1168
Schroff F, Criminisi A, Zisserman A (2008) Object class segmentation using random forests. In: Proceedings of the British machine vision conference, pp 1–10
Serra G, Camurri M, Baraldi L, Benedetti M, Cucchiara R (2013) Hand segmentation for gesture recognition in ego-vision. In: Proceedings of the 3rd ACM international workshop on interactive multimedia on mobile & portable devices. ACM, pp 31–36
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 56. IEEE, pp 1297–1304
Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intel 35 (12):2821–2840
Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Proceedings of the 9th European conference on computer vision. Springer, pp 1–15
Ungureanu AS, Bazrafkan S, Corcoran P (2018) Deep learning for hand segmentation in complex backgrounds. In: Proceedings of the the IEEE conference on consumer electronics. IEEE, pp 1–2
Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on pixel-based skin color detection techniques. In: Proceedings of the 13th international conference on computer graphics and vision, vol 3. MSU, pp 85–92
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 1. IEEE, pp 511–518
Vodopivec T, Lepetit V, Peer P (2016) Fine hand segmentation using convolutional neural networks. arXiv
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional lstms. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 988–997
Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: Proceedings of the international conference on tools with artificial intelligence. IEEE, pp 234–241
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276
Wang C, Yang H, Meinel C (2016) Exploring multimodal video representation for action recognition. In: Proceedings of the international joint conference on neural networks. IEEE, pp 1924–1931
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. ACM Trans Multimed Comput Commun Appl 14(2s):40
Wang Q, Gao J, Yuan Y (2018) Embedding structured contour and location prior in siamesed fully convolutional networks for road detection. IEEE Trans Intell Transp Syst 19(1):230–241
Wang Q, Gao J, Yuan Y (2018) A joint convolutional neural networks and context transfer for street scenes labeling. IEEE Trans Intell Transp Syst 19(5):1457–1470
Winn J, Criminisi A (2006) Object class recognition at a glance. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE
Winn J, Shotton J (2006) The layout consistent random field for recognizing and segmenting partially occluded objects. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition, vol 1. IEEE, pp 37–44
Zabulis X, Baltzakis H, Argyros AA (2009) Vision-based hand gesture recognition for human-computer interaction. 30–88. LEA
Zhu X, Jia X, Wong KYK (2014) Pixel-level hand detection with shape-aware structured forests. In: Proceedings of the Asian conference on computer vision. Springer, pp 64–78
Zhu X, Jia X, Wong KYK (2015) Structured forests for pixel-level hand detection and hand part labelling. Comput Vis Image Underst 141:95–107
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Features
1.1 A.1 Comparison of the new implementation to estimate texture using gabor filters
Figure 20 provides a comparison of the filter response using the new implementation and the original method for one of the selected scale and orientation. The error in this case was 0.02% in mean square sense.
Appendix B: Optimization of feature parameters
1.1 B.1 Gabor texture feature
As explained in Section 3.2, there are two parameters associated with the Gabor texture feature: number of scales (nScale) and number of orientations (nOrient). Figure 21 shows how nScale affects the segmentation performance and feature extraction time. The maximum possible value for nScale is 5, limited by image dimensions and the scale down operation used for the faster implementation. It can be observed from the figure that the precision increased with number of scales for both scenarios. On the other hand, the behaviour of recall was different between the databases. The time required for feature estimation showed almost linear relationship with the number scales. nScale = 3 gave good performances in both scenarios.
The effect of nOrient on the segmentation output is shown in Fig. 22. Both scenarios displayed convergence of the evaluation measures when 8 orientations were used. The time for feature extraction varied linearly with the number of orientations because of the corresponding increase in filtering operations required at each scale.
B.2 Laws texture feature
Number of scales (nScale) and filter width (fWidth) are the two parameters of Laws texture approach. The influence of nScale on segmentation output when fWidth = 3 and fWidth = 5 is shown in Figs. 23 and 24 respectively. The effects of number of scales on segmentation performances were observed to be very similar to that of the Gabor texture method, with optimum results achieved for the nScale value of 3. The segmentation outputs when the filters used were of size 5x5 were slightly better than 3x3 filters.
B.3 Neighborhood difference feature
Effects of the three parameters neiCount, neiSpace and neiOrient on the segmentation performance when the HSV color space was used are shown in Figs. 25, 26 and 27 respectively. Increasing the number of neighbors improved all three evaluation measures for both scenarios at the expense of the time required for feature estimation which increased linearly. In the case of neiSpace, all the evaluation measures converged to a maximum for the value 5. The time was unaffected by this parameter. Lastly, all the evaluation measures showed convergence for a neiOrient value of 8. The time for feature estimation showed an almost linear relationship with number of neighbor orientations.
B.4 Neighbourhood histogram feature
The effects of nBins and hWidth parameters of the neighborhood histogram feature on segmentation performance using the H channel are shown in Figs. 28 and 29 respectively. In the case of nBins, all evaluation measures converge to a maximum for both scenarios. The time requirement varied linearly with increasing number of bins. On the other hand, the hWidth parameter showed a decrease in recall and f-score after crossing a limit value. Due to the custom implementation used, the time was unaffected by this parameter.
B.5 Neighborhood probability feature
The effects of different parameters of the neighborhood probability feature on the segmentation performance are given in Figs. 30, 31 and 32. The behaviour is found to be similar to that of the neighborhood difference feature.
Appendix C: Evaluation of other feature combinations
Figure 33 shows the evaluation results of feature combinations which are not listed in Section 4.4. The patterns visible in these results are similar to the ones identified earlier.
Rights and permissions
About this article
Cite this article
Martin, M., Nguyen, T., Yousefi, S. et al. Comprehensive features with randomized decision forests for hand segmentation from color images in uncontrolled indoor scenarios. Multimed Tools Appl 78, 20987–21020 (2019). https://doi.org/10.1007/s11042-019-7445-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-7445-3