Abstract
In this article, the problem of human interaction recognition is addressed. A novel methodology is presented to model the interaction between two people using a Kinect sensor. It is proposed to analyse the distances between a subset of skeleton joints to determine their contribution to the recognition of interaction. Subsequently, the most representative joints are taken into account for the formation of a pentagon for each person. Five Euclidean distances are extracted between the vertices of two pentagons corresponding to the people to analyse. Finally, the SVM is used for the recognition of human interaction. Experimental results demonstrate the effectiveness of this work compared to recently published proposals.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Human interaction recognition
- Kinect
- Subset
- Skeleton joints
- Representative joints
- Pentagon
- Euclidean distances
- SVM
1 Introduction
Human-human interaction recognition has attracted increasing importance in recent years due to its large applications in computer vision fields. However, recognizing human interactions remains a very difficult task especially in realistic environments because of its large intra-variations, lighting changes, clutter, partially or totally occluded target objects. In recent decades, many studies have addressed this issue [1, 2]. Recent work [3,4,5] has suggested that accuracy of human interaction recognition based on both color images and depth maps can provide better accuracy. In addition, it is known that a human joint sequence is an effective representation for structured movement [6]. Therefore, in this paper we present a new approach for human-human interaction recognition using depth cameras.
This paper proposes a framework for robust human interaction recognition and demonstrates its performance on public benchmark dataset. The two-persons interactions presented are parallel to the Kinect sensor. Each interaction is captured in three different angles (0°, 45° and 135°). Eight two-person interactions are taken into account, namely Approaching, Departing, Kicking, Pushing, Shaking Hands, Hugging, Exchanging and Punching. A pentagon has created for each human for each frame of a sequence. The pentagon vertices generated by the algorithm have three-dimensional information because the Kinect sensor models the human body using 20 body joint coordinates in three dimensions. For each frame two various pentagons are formed corresponding to two person interaction. We calculated the Euclidean distances between similar vertices of two different persons as features for our algorithm. For the recognition purpose, multi-class support vector machine (SVM) [7, 8] is applied. Experimentally, it is found that recognition rate of our method based on distances between a subset of two persons skeleton Joints outperforms other state of the art method.
Our contributions can be summarized as follows:
-
We propose a novel framework for human interaction recognition using depth information including an algorithm to reconstruct depth sequence with as few key frames as possible.
-
We extend the low-level features (joint-joint distances) previously used in [9] computed over the entire human skeleton and propose a new person-to-person interaction feature that roughly describe the distances between a subset of joints extracted from skeleton two of two person.
The paper is organized as follows. Related work is reviewed in Sect. 2. Section 3 provides a detailed description of our approach and defines the proposed two-person interaction descriptor. Section 4 presents the experimental results and Sect. 5 concludes the paper.
2 Related Work
Over the last two decades, various approaches have been suggested to focus on different representations of the interaction between two or more persons. In [10], Park et al. propose an algorithm to achieve human activity recognition with distributed camera sensors. In track-level analysis, the gross-level activity patterns are analysed and in body-level analysis, person’s activity is analysed in terms of the individual body parts’ coordination. In [11], a method is proposed to extract the spatial semantics between the humans, including front, back, face to face, back to back, and left or right. To recognize the interactions between persons, they used context-free grammar. Rehg et al. [12] have proposed the interactions between children and an adult. The extracted Information is combined to acquire the social representations of the interaction. In Kong et al. [13], a method is proposed based on primitive interactive phrases for recognizing complex human interactions. Relationship is described by means of a set of features which are common to different relationships. Raptis and Sigal [14] extract the most discriminative key frames and consider the local temporal context between them. In Blunsden and Fisher [15], the movement is classified using a hidden Markov mode as a feature interaction in a group of people. The body orientation is also taken into account as an important aspect related to the interaction between two persons.
Currently, many studies have applied skeleton-based classification for people interaction recognition. One of the advantages of the depth cameras is the ability to capture human skeleton information in real time and provide the 3D joint coordinates of a human model. Yun et al. [9] present a method based on relational body-pose features which characterize geometric relations between specific joints in a single pose or a short sequence of poses like joint motion, joint distance, velocity and plane. In classification phase, they apply linear SVMs and multiple instance learning (MIL) with a bag of body poses. In [16], a feature descriptor which combines spatial relationship and semantic motion trend similarity is proposed for human-to-human interaction recognition. The motion trend of each skeleton joint is firstly quantified into the specific semantic word and then a Kernel is built for measuring the similarity of either intra or inter body parts by histogram interaction. Chen et al. [17] proposed a two-level hierarchical method based on the skeleton information. In the first layer, the most representative articulations are determined using a part-based clustering feature vector. In the second layer, only relevant joints within specific clusters are used for feature extraction.
This paper presents a new method to describe person-to-person interactions using skeleton-based features. Hence, we present a system, which implements a simple technique to extract pertinent human interaction features adequate for two-interacting persons. Conversely, from other works that extract features based on the whole human skeletons [9], the developed system models an interaction using a few and basic informative postures based on a few selected joints. Our System is tested on SBU Kinect Interaction dataset [9] and shows good performance.
3 Proposed Approach
Our system aims at recognizing human-human interactions based on the distances between two-person skeleton joints extracted from depth images. This skeleton contains the 3-D position of a certain number of joints representing different parts of the human body and provides strong cues to recognize human interaction.
An overview of the proposed methodology is presented in Fig. 1. First, key frames are collected from the video sequence. As skeleton information, we calculate distances between a subset of skeleton joints captured by Microsoft Kinect sensor. For each frame, five Euclidean distances are extracted between two pentagon vertices corresponding to two persons. These features are used for classification using Support Vector Machine classifier to recognize human interactions.
3.1 Key Frames Extraction
This section explores a method of key-frames extraction algorithm. It is based on two phases: First phase computes threshold using mean and standard deviation of absolute difference of histogram of consecutive image frames. Second phase extracts key frames comparing the threshold against absolute difference of histogram of consecutive image frames. The algorithm starts by extracting frames from video. Next, histogram difference between two consecutive frames is calculated. The threshold (T) is computed using following equation.
Where Standard Deviation and Mean are the standard deviation and mean of absolute difference of histogram respectively and a is constant. Next phase determine the key-frames by comparing the absolute difference of histogram against threshold. If the threshold value is greater than output pair then first image of that pair is considered as key frame otherwise second image of that pair is considered as a key frame. The proposed algorithm is given below:
Implementation Steps
Here input video V is read frame by frame 1st to nth frame.
Step 1. Read an input video with N frames with n × n size.
Step 2. Read consecutive ith and (i + 1)th frames.
Step 3. Histogram difference between two consecutive frames.
Step 4. Calculate mean and standard deviation of absolute difference.
Step 5. Compute threshold.
Step 6. If difference > T select (i + 1)th frame as key-frame.
else Discard it.
Step 7. Repeat this for all frames until the end of the video. Key frames are collected.
3.2 Human Interaction Representation
The human interaction feature extraction step is the core of the system. The input of this module is a temporally ordered key frames of the recorded video extracted previously, representing the most important postures of the original skeleton sequence. One of the advantages of the adoption of skeleton features is that it ensures that the system is not affected by environmental light variations.
Our system is based on the idea that, the limitation of the single person skeleton features is that information about the surrounding humans is not provided. In fact, these could be exploited to model an action with humans by taking into account the relationships between skeletons to represent two-person interaction. This system is an improvement of the work already presented in [9]. The actual implementation differs from the previous one in the use of distances between a subset, instead of using distances between all skeleton joints. Each person is described using a pentagon. The five vertices of the pentagon are formed as follows: The 1st vertex is obtained by averaging head (H) and shoulder centre (SC) joints. The second vertex is calculated using the average values of shoulder right (SR), elbow right (ER), wrist right (WR) and hand right (HR). Similarly, mean values of wrist left (WL), hand left (HL), shoulder left (SL) and elbow left (EL) are scored to generate vertex 3. Vertices 4 and 5 are due to the average values of leg coordinates for right and left legs. Hip right (HR), knee right (KR), ankle right (AR) and foot right (FR) are taken into account for representation of vertex 4. In the same way, hip left (HL), knee left (KL), ankle left (AL) and foot left (FL) are noted and vertex 5 is generated by averaging values of those joints.
In Fig. 2, the previous computed pentagon vertices are presented using black stars. The red dashed lines illustrate the edges of the pentagons. For each person, a pentagon is formed at the ith frame. Let the right and left humans are represented using R and L respectively. The Euclidean distance between vertex no i, j is determined by (7).
3.3 Human Interaction Classification Using SVM
In this last step, multiclass SVMs are trained for supervised classification as considered in [18] using human interactions features obtained in the previous step (see Sect. 3.2). We employ the Multi-SVMs classifier that reaches better performances compared with other classifiers types.
4 Experimental Results
4.1 Recognition Results
In this section, we evaluate the proposed method and compare its performance to other methods. We have tested our system on a publicly available dataset: the SBU Kinect Interaction dataset [9]. Eight types of human-human interactions are presented: hugging, kicking, hand shaking, punching, exchanging and pushing. This system acquires a recognition rate of 95 with the use of multi SVM classifier. When the angle of interaction varies with respect to Kinect sensor, i.e. when the angle is 45° or 135°, then the average performance degrades to 81.3. The pentagon formed for each person is described with green dotted lines. The red lines are the illustration of Euclidean distances between the vertices of the two pentagons as presented in Fig. 3.
Table 1 presents the experimental values obtained for five Euclidean distances compared to other state of the art methods (Table 2 and Fig. 4).
4.2 Results Discussion
The experimental results are very encouraging and show the potential of the use of the distances between a subset of skeleton joints as features for two-person interaction recognition compared to the use of distances between the whole skeleton joints. In [9], two person interactions are modeled using multiple instance learning and Linear SVMs with achieved maximum recognition of 80%. But in this work, the rotation variance is not taken into account. The proposed method, not only able to produce higher recognition rates than [9] for rotation invariance purpose, but also capable to achieve good results. The limitation of [9] is overcome with high efficiency and also with less timing complexity.
5 Conclusions and Future Work
In this paper, we have presented a new framework for describing two-person interaction based on skeleton characteristics. The distances between a subset of human skeleton joints are calculated to distinguish between these interactions. This subset is obtained using the skeleton joints contributing the most towards an activity. Key frames are extracted from video and a pentagon is constructed for each person. Five Euclidean distances are calculated between two pentagon vertices corresponding to two persons. Feature vectors are composed using the 3D position of these joints as well as the distances between them and are used by SVM for classification. We have been able to reach an accuracy of 95% when using a subset of joints. The recognition accuracy when tracking all skeleton joints is also evaluated for comparison and is found to be lower than our algorithm. Future work may include the case of an activity on which the algorithm has not been previously trained is given as input to the classifier. The SVM, as any other machine learning algorithm, requires training in all possible outcomes. The unknown activity will be labelled as the one it looks like the most from the training set.
References
Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recognition of continued and recursive human activities. Int. J. Comptut. Vis. 28(1), 1–24 (2009)
Wang, H., Yuan, C., Hu, W., Ling, H., Yang, W., Sun, C.: Action recognition using nonnegative action component representation and sparse basis selection. IEEE Trans. Image Process. 23(2), 570–581 (2014)
Ni, B., Wang, G., Moulin, P.: RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: ICCV Workshops (2011)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: CVPR Workshops (2010)
Alcoverro, M., Lopez-Mendez, A., Pardas, M., Casas, J.: Connected operators on 3D data for human body analysis. In: CVPR Workshops (2011)
Gu, J., Ding, X., Wang, S., Wu, Y.: Action and gait recognition from recovered 3-D human joints. IEEE Trans. Syst. Man Cybern. 40, 1021–1033 (2010)
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2650–2657 (2013)
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2006)
Yun, K., Honorio, J., Chattopadhyay, D., et al.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 28–35 (2012)
Park, S., Trivedi, M.: Multi-person interaction and activity analysis: a synergistic track- and body-level analysis framework. Mach. Vis. Appl. 18, 151–166 (2007)
Jin, B., Hu, W., Wang, H.: Human interaction recognition based on transformation of spatial semantics. IEEE Signal Process. Lett. 19(3), 139–142 (2012)
Rehg, J., Abowd, G., Rozga, A., et al.: Decoding children’s social behavior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3414– 3421 (2013)
Kong, Yu., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 300–313. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_22
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2013, pp. 2650–2657 (2013)
Blunsden, S., Fisher, R.B.: The behave video dataset: ground truthed video for multi-person behavior classification. Ann. BMVA 4(1–12), 4 (2010)
Liu, B., Cai, H., Ji, X., Liu, H.: Human-human interaction recognition based on spatial and motion trend feature. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 4547–4551 (2017)
Chen, H., Wang, G., Xue, J.H., He, L.: A novel hierarchical framework for human action recognition. Pattern Recogn. 55, 148–159 (2016)
Ammar, S., Zaghden, N., Neji, M.: A framework for people re-identification in multi-camera surveillance system. In: 14th International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2017), pp. 319–322 (2017)
Ji, Y., Cheng, H., Zheng, Y., Li, H.: Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 33, 340–349 (2015)
Ji, Y., Ye, G., Cheng, H.: Interactive body part contrast mining for human interaction recognition. In: Multimedia and Expo Workshops (ICMEW). IEEE (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ammar, S., Zaghden, N., Neji, M. (2019). An Effective Approach Based on a Subset of Skeleton Joints for Two-Person Interaction Recognition. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_106
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_106
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)