CN111695523A - Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information - Google Patents
Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information Download PDFInfo
- Publication number
- CN111695523A CN111695523A CN202010539760.5A CN202010539760A CN111695523A CN 111695523 A CN111695523 A CN 111695523A CN 202010539760 A CN202010539760 A CN 202010539760A CN 111695523 A CN111695523 A CN 111695523A
- Authority
- CN
- China
- Prior art keywords
- joint
- motion
- space
- skeleton
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 36
- 238000010586 diagram Methods 0.000 claims abstract description 75
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 40
- 230000002708 enhancing effect Effects 0.000 claims abstract description 11
- 230000000877 morphologic effect Effects 0.000 claims abstract description 9
- 238000006243 chemical reaction Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 210000001503 joint Anatomy 0.000 claims description 27
- 230000007797 corrosion Effects 0.000 claims description 9
- 238000005260 corrosion Methods 0.000 claims description 9
- 230000003628 erosive effect Effects 0.000 claims description 9
- 230000003068 static effect Effects 0.000 claims description 9
- 210000004394 hip joint Anatomy 0.000 claims description 7
- 210000002414 leg Anatomy 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 108091026890 Coding region Proteins 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000009286 beneficial effect Effects 0.000 claims description 3
- 230000010339 dilation Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 2
- 239000003086 colorant Substances 0.000 claims 1
- 230000000007 visual effect Effects 0.000 abstract description 6
- 230000008859 change Effects 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract 1
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000003127 knee Anatomy 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 210000000707 wrist Anatomy 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 210000003423 ankle Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 210000002683 foot Anatomy 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/60—Rotation of whole images or parts thereof
- G06T3/604—Rotation of whole images or parts thereof using coordinate rotation digital computer [CORDIC] devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration using local operators
- G06T5/30—Erosion or dilatation, e.g. thinning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information, which belongs to the field of computer visual image and video processing and is used for solving the problem of low recognition rate of the action recognition method based on skeleton information in a complex scene, and has the key points that: (1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence; (2) constructing a skeleton space-time characteristic diagram and a joint movement velocity diagram based on the converted coordinate information; (3) respectively enhancing the characteristics of a skeleton space-time characteristic diagram and a joint motion velocity diagram based on motion significance and morphological operators; the motion classification is realized based on the bone space-time characteristic diagram and the joint motion velocity diagram which are subjected to the deep fusion and enhancement of the double-flow convolutional neural network, and the effect is that the motion identification accuracy can be effectively improved for complex scenes with visual angle change, rich noise, subtle difference motion and the like.
Description
Technical Field
The invention belongs to the field of computer visual image and video processing, and relates to a motion identification method based on skeleton space-time and dynamic characteristics and combined with a double-flow convolutional neural network (Two-Stream-CNN, TS-CNN).
Background
As a research hotspot in the field of computer vision, human body action recognition has important application value in the fields of intelligent monitoring, human-computer interaction, video retrieval and the like. The following technical difficulties are mainly faced: such methods are less robust due to factors such as illumination variations and cluttered backgrounds. The depth image information redundancy is high, so that the algorithm computation complexity is increased, and the practical application of the method is limited. And because the original bone information captured by the depth sensor contains noise and joint space-time information is fuzzy, how to effectively extract motion information through three-dimensional bone data to identify human body actions still faces huge challenges. The identification action method based on the manually extracted features has single extracted features, so the identification precision is limited and the universality is poor; based on good time sequence modeling capability of the RNN, an action recognition model is constructed by using the RNN, however, the RNN cannot effectively express an inter-joint space domain relation; based on the powerful spatial domain feature extraction capability of the CNN, the CNN is utilized to extract action features from a framework sequence coding image, each joint information is independently coded into a color image, and the traditional method mainly solves the following problems when the framework sequence is coded into a color texture map: first, each joint information is independently encoded as a color image, but the inter-joint related information is ignored; secondly, space constraint between joints is ignored, so that joint airspace information is disordered, and the recognition accuracy is limited; finally, only the static characteristics of the joints are concerned, the dynamic characteristics of the joints are ignored, and the different participation degrees of the joints in motion completion are not considered, so that the motion information coding is incomplete, the joint airspace significance information is lost, and the motion recognition rate is limited.
Disclosure of Invention
In order to solve the problems, the invention provides a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information, which can solve the problem of low recognition rate of the action recognition method based on skeleton information in a complex scene.
The invention adopts the following technical scheme: a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information comprises the following steps:
(1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence.
(2) Constructing a skeleton space-time characteristic diagram and a joint motion velocity diagram based on the converted coordinate information, which specifically comprises the following steps:
and (2.1) coding the relative coordinates and the absolute coordinates of the joints into a skeleton space-time characteristic diagram based on human body structure constraint.
And (2.2) coding joint velocity information at the same time step into a joint motion velocity map.
(3) And respectively enhancing the characteristics of the skeleton space-time characteristic diagram and the joint motion velocity diagram based on the motion significance and the morphological operator.
(4) And realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network.
Further, the step (1) is specifically as follows:
the skeleton sequences captured by the depth sensor are all located in a Cartesian coordinate system with the camera as an origin, and coordinate system conversion is carried out on skeleton three-dimensional coordinates to obtain a body coordinate system for effectively representing airspace information, wherein the method comprises the following steps:
constructing a body coordinate system taking a hip joint with a small motion amplitude as a primary center, and for a video sequence with N joint points and F frames, converting joint coordinates into the following components:
wherein ,respectively the coordinate information of the joint j of the f-th frame before and after the transformation of the coordinate system,coordinate information of the ith frame of the hip joint.
Further, the step (2) is specifically as follows:
in the step (2.1), joint absolute coordinates and relative coordinates between joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing motion space-time domain characteristics, and the method comprises the following steps:
wherein ,is the three-dimensional coordinate of the jth joint relative to the ith joint in the f frame, and is represented as the spatial information of the jth and ith joint connected bones, when i is 1,is the absolute coordinate of the jth joint,
space-time characteristic of j-th joint is represented by matrix Qj_iExpressed as:
only the first-level and second-level relevant information with higher relevance is selected, and the first-level and second-level relevant information are respectively shown as the following formulas:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](4)
wherein, h, k; j, i; m, n represent a pair of joints connected by only one edge, p, o; u, v; y, x represent a pair of joints connected by two sides.
Arranging coordinate information according to body structures, and dividing all joints of the body into the following five groups: the left arm, the right arm, the left leg, the right leg and the trunk are arranged according to the physical connection sequence among joints, and therefore a skeleton space-time characteristic diagram obtained by the coding sequence is as follows:
order toThe three-dimensional coordinates correspond to R, G, B three channels respectively, and the skeleton space-time characteristic E is obtainedkconverted to a 72 xf bone space-time profile.
In the step (2.2), velocity information of each joint is extracted to represent motion dynamic characteristics, a feature descriptor representing joint motion characteristics is constructed based on velocity scalar quantity information, and velocity values of the joint in the x, y and z directions in the frame are represented as follows:
wherein ,the three-dimensional coordinate value of the joint in the f + delta f frame. Δ f is the time step, Δ t is:
where the FPS is the frame rate of the camera employed.
V is to bex、vy、vzcorresponding to R, G, B, respectively, the coded joint motion information is an N × (F-delta F) dimensional joint motion velocity map.
Further, the step (3) is specifically as follows:
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the class k action sequence, the coordinates areThe instantaneous energy of the joint i in the f-th frame is:
wherein f is greater than 1. I | · | | represents the euclidean distance, and the motion energy of the joint i in the whole motion sequence is:
in the formula ,the maximum and minimum values of all joint motion energy during the class k motion sequence, respectively.
According toThe coding sequence is to weight the color of all joints in the k-th action to omegakCoding as a motion enhancement weight:the enhanced bone space-time feature map is shown as:
(3.2) enhancing the joint movement speed characteristic pattern texture information based on a morphological operator to improve the speed estimation performance, wherein the method comprises the following steps:
firstly, carrying out corrosion operation on the joint motion velocity diagram to eliminate noise as follows:
wherein, X is a binary image, theta represents corrosion operation, and E is a structural element. The formula (12) is used for the speed values v of the joints in the x, y and z directions in the frame f obtained in the step (2.2)x、vy、vzAnd (3) carrying out corrosion operation:
Iv=[vxΘE vyΘE vzΘE](13)
wherein IvVelocity diagram showing joint movement after erosion
And performing expansion operation on the corroded image:
in the formula ,JvA graph showing the joint motion velocity after erosion and expansion, theta shows erosion operation,indicating the dilation operation.
Further, the step (4) is specifically as follows:
the double-current convolutional neural network model is an AlexNet model, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet model is 64, 256 and 256 respectively, a bone space-time characteristic diagram and an articulation velocity diagram are used as input of dynamic and static currents respectively, and after the bone space-time characteristic diagram and the articulation velocity diagram are processed through a convolutional layer, a pooling layer and a full connection layer, posterior probabilities generated by single-current CNN are fused into a final recognition result.
Further, a bone space-time characteristic diagram and a joint movement velocity diagram are respectively used as input of dynamic and static flows, and after being processed by a convolutional layer, a pooling layer and a full-link layer, the posterior probability generated by the single-flow CNN is fused into a final recognition result, and the method comprises the following steps:
given the bone sequence Smprocessing to respectively obtain a bone space-time characteristic diagram and a joint movement velocity diagram, scaling the two to 227 × 227 pixels through bilinear interpolation to be beneficial to subsequent depth characteristic extraction, outputting the depth characteristic extracted based on CNN to the last layer of full-connected layer, and then normalizing the depth characteristic by a Softmax function to obtain the posterior probability:
wherein ,as an image of the mth bone sequenceThe probability of belonging to the nth class of action,the input of the nth neuron of the last full connection layer is shown, x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories.
The double-current convolution neural network model outputs n signals at a timeAndmultiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PJMSM)) (16)
wherein Fin (-) is a maximum label function, Max (-) is a maximum operator, ⊙ is a Hadamard product operator, SSTM represents a bone space-time characteristic diagram, JMSM represents a joint motion velocity diagram, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
has the advantages that: the invention is based on the action recognition of space-time and dynamic characteristics, and transforms the coordinate system of each type of action; constructing descriptors of skeleton space-time characteristics and motion characteristics; joint space domain information with obvious movement characteristics in the skeleton space-time characteristic diagram is enhanced, and a joint motion velocity diagram is enhanced by using a morphological operator to eliminate noise; and realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network. In the invention, because a relatively stable joint is selected as a coordinate origin to transform a skeleton sequence coordinate system, the obtained body coordinate system can effectively represent the related information between joints, and a skeleton space-time characteristic diagram is constructed by using the related information; body structure constraint is added when a skeleton sequence is coded, so that the recognition rate among different types of actions is greatly improved; in addition, after the dynamic skeleton information is added, the motion characteristic information is more comprehensively represented, so that the overall recognition rate of the invention is obviously improved; and finally, the difference between similar actions is reduced by enhancing the motion significance, and the error recognition rate between similar actions is reduced. Compared with the mainstream human body action identification method, the method has higher identification rate under the complex scenes of visual angle change, noise, main body diversity, similar action diversity and the like.
Drawings
FIG. 1 is a schematic flow chart of the main framework of the method of the present invention.
Fig. 2 shows the bone coordinates of the Kinect coordinate system.
Fig. 3 is a body coordinate system joint visualization.
Figure 4 shows a set of 25 human joints.
FIG. 5 is a joint distance graph and a proposed skeleton space time feature graph: a1 of FIG. 5 is a joint distance map; a2 in fig. 5 is a skeleton space time characteristic diagram.
FIG. 6 is an image enhanced color texture map: b1 in fig. 6 shows the motion enhancement of the skeleton space-time characteristic diagram; b2 of fig. 6 is the articulation velocity map visual enhancement.
Fig. 7 is a model of a dual-stream convolutional neural network.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
In the invention, the flow of the double-current convolutional neural network action identification method based on skeleton space-time and dynamic information is shown in the attached figure 1, and the implementation steps are as follows:
(1) carrying out coordinate system conversion on the skeleton sequence to obtain a body coordinate system taking the hip joint as a coordinate origin;
skeleton sequences captured by a depth sensor such as Kinect are all located in a cartesian coordinate system with a camera as an origin, and as shown in fig. 2, a body coordinate system for obtaining effective representation airspace information needs to perform coordinate system conversion on skeleton three-dimensional coordinates, which specifically includes:
and constructing a body coordinate system taking the hip joint with smaller motion amplitude as a primitive center. For a video sequence with N joint points, F frames, the joint coordinate transformation can be expressed as:
wherein ,respectively the coordinate information of the joint j of the f-th frame before and after the transformation of the coordinate system,coordinate information of the ith frame of the hip joint. Post-transform joint visualization is shown in fig. 3.
(2) Constructing a skeleton space-time characteristic diagram and a joint movement velocity diagram based on the converted coordinate information;
step (2.1): the relative coordinates between joints and the absolute coordinates of the joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing the space-time domain characteristics of actions, and the method comprises the following steps:
wherein ,and the three-dimensional coordinates of the jth joint relative to the ith joint in the f frame are represented, and the spatial information of the j and i joint connected bones is also represented. Further, when i is 1,as absolute coordinates of the j-th joint, i.e.
Based on the above, the space-time characteristics of the j-th joint can be represented by the matrix Qj_iExpressed as:
in the invention, only the first and second (namely only one or two joint pairs with connected edges) level related information with higher correlation degree is selected, so that the calculation complexity is reduced, the inter-class confusion is reduced, and the intra-class robustness is improved. The first and second levels of relevant information are respectively shown as follows:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](21)
wherein, h, k; j, i; m, n, etc. represent joint pairs connected by only one side, such as left wrist and left elbow, left ankle and left knee, p, o; u, v; y, x, etc. represent joint pairs connected by two sides, such as left wrist and shoulder, left foot and knee, etc.
Since the sensitivity area of CNN increases with the depth of the network, the spatial information between the joint pairs with high correlation should be extracted in a shallow layer, and the spatial information with low correlation should be acquired in a deep layer. The proposed joint distance map, as shown in a1 of fig. 5, arranges joint information in a color image in a fixed order while ignoring the difference in relative spatial information, arranges coordinate information in accordance with a body structure, and divides all joints into the following five groups: left arm, right arm, left leg, right leg, torso, each group is arranged according to the physical connection order between joints, as shown in fig. 4. Taking the right arm as an example, the joint points [25,24,12,11,10,9] are adjacent in fig. 4, so that the correlation degree is higher, and the spatial relationship between the joint points can be more effectively extracted by grouping the joint points into a group. Based on the above, the resulting bone space-time feature map can effectively encode the space-time domain information of the joint, as shown in a2 of fig. 5.
The bone space-time characteristic diagram obtained based on the coded bone sequence is as follows:
wherein k is the motion category, A is the absolute coordinate of the joint point,order toThe three-dimensional coordinates correspond to R, G, B three channels respectively, so that the space-time characteristics E of the skeleton can be obtainedkconverted to a 72 xf bone space-time profile.
And (2.2) extracting the speed information of each joint to represent the dynamic characteristics of the motion, and constructing a characteristic descriptor representing the motion characteristics of the joint based on the speed scalar information. The velocity values of the joint in x, y, z in f frames can be expressed as follows:
wherein ,the three-dimensional coordinate value of the joint in the f + delta f frame; Δ f is the time step, Δ t is:
wherein FPS is the frame rate of the Kinect camera employed.
V is to bex、vy、vzcorresponding to R, G, B, respectively, the joint motion information can be encoded as an N × (F- Δ F) dimensional joint motion velocity map.
(3) Based on the method for respectively enhancing the characteristics of the skeleton space-time characteristic diagram and the joint motion velocity diagram based on the motion significance and the morphological operator, the inter-class difference of different actions is improved, and the intra-class difference of the same actions is reduced;
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the kth motion sequence, the coordinates areThe instantaneous energy of the joint i in the f-th frame is:
wherein f is greater than 1; i | · | | represents the euclidean distance. From this, the motion energy of the joint i in the whole motion sequence is:
in the formula ,the maximum and minimum values of all joint motion energy during the kth motion sequence, respectively.
According to the coding sequence, all joint color weights omega in the k-th actionkCoding as a motion enhancement weight:the enhanced bone spatiotemporal feature map may be represented as:
as shown in b1 of fig. 6, the color corresponding to the joint related information with high motion energy is enhanced, and the color information of the joint with low motion energy is blurred, so that the motion classification capability can be improved by adopting the proposed adaptive enhancement mode to make the skeleton space-time feature map have motion saliency features.
And (3.2) enhancing the motion characteristic map texture information based on a morphological operator to improve the speed estimation performance. The method firstly carries out corrosion operation on the joint motion velocity diagram to eliminate noise, namely:
wherein, X is a binary image, theta represents corrosion operation, and E is a structural element.
V obtained by the reaction of formula (12) on step (2.2)x、vy、vzCarrying out corrosion operation, namely:
Iv=[vxΘE vyΘE vzΘE](30)
and performing expansion operation on the corroded image to restore and smooth the original texture so as to effectively reduce the intra-class speed difference. Adding expansion operation to obtain:
wherein IvA graph showing joint motion velocity after erosion;
and performing expansion operation on the corroded image to restore and smooth the original texture so as to effectively reduce the intra-class speed difference. Adding expansion operation to obtain:
in the formula ,JvA graph showing the joint motion velocity after erosion and expansion, theta shows erosion operation,indicating the dilation operation.
As shown in b2 of fig. 6, compared to the original image (first line), the texture of the enhanced image (second line) is smoother, and under the condition that the original texture is kept basically unchanged, useless information is effectively removed, so that the difference between similar actions is reduced.
(4) Constructing a bone space-time characteristic diagram and a joint motion velocity diagram based on the deep fusion and enhancement of the double-current convolution neural network to realize action classification;
the dual-current convolutional neural network model is composed of two improved alexnets, and as shown in fig. 7, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet is changed from 96, 384 and 384 to 64, 256 and 256, respectively, so as to form the dual-current convolutional neural network model in the invention.
And respectively taking the skeleton space-time characteristic diagram and the joint movement velocity diagram as the input of the dynamic and static flows, and fusing the posterior probabilities generated by the single flow CNN into a final recognition result after the processing of the convolutional layer, the pooling layer and the full-link layer.
Given the bone sequence Smthe bone space-time feature map and the joint movement velocity map can be obtained through the processing, and the two maps are scaled to 227 × 227 pixels through bilinear interpolation so as to be beneficial to subsequent depth feature extraction, the depth features extracted based on the CNN are output to the last layer of fully-connected layer, and then are normalized by a Softmax function, and the posterior probability can be obtained as follows:
wherein ,as an image of the mth bone sequenceThe probability of belonging to the nth class of action,the input of the nth neuron of the last full connection layer is shown, x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories.
In the proposed model, n outputs are output at a timeAndmultiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PMSM)) (34)
wherein Fin (-) is the maximum label function, Max (-) is the maximum operator, as Hadamard product operator, SSTM indicates the boneSpace-time characteristic diagram, JMSM representing the velocity diagram of joint movement, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
the invention relates to a double-current convolution neural network action identification method based on skeleton space-time and dynamic information, which comprises the steps of firstly transforming a skeleton three-dimensional coordinate system to obtain coordinate information containing relative positions of joints; secondly, coding the related information among joints into a color texture map to construct a skeleton space-time feature descriptor, and considering the physical structure constraint of a human body to increase the difference among classes; then, estimating the velocity information of each joint, and coding the velocity information into a color texture map to obtain a skeleton motion characteristic descriptor; in addition, the obtained space-time and dynamic characteristics are respectively enhanced based on the motion significance and the morphological operator so as to further improve the characteristic expression capability; and finally, the enhanced bone space-time and dynamic characteristics are deeply fused through a double-flow convolutional neural network to realize action recognition. Aiming at complex scenes with visual angle change, rich noise, subtle difference action and the like, the method can effectively improve the action recognition accuracy.
The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.
Claims (6)
1. A double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information is characterized by comprising the following steps: the method comprises the following steps:
(1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence;
(2) constructing a skeleton space-time characteristic diagram and a joint motion velocity diagram based on the converted coordinate information, which specifically comprises the following steps:
(2.1) coding the relative coordinates and the absolute coordinates of the joints into a skeleton space-time characteristic diagram based on human body structure constraint;
(2.2) encoding joint velocity information at the same time step into a joint motion velocity map;
(3) respectively enhancing the characteristics of a skeleton space-time characteristic diagram and a joint motion velocity diagram based on motion significance and morphological operators;
(4) and realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network.
2. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (1) is specifically as follows:
the skeleton sequences captured by the depth sensor are all located in a Cartesian coordinate system with the camera as an origin, and coordinate system conversion is carried out on skeleton three-dimensional coordinates to obtain a body coordinate system for effectively representing airspace information, wherein the method comprises the following steps: :
constructing a body coordinate system taking a hip joint with a small motion amplitude as a primary center, and for a video sequence with N joint points and F frames, converting joint coordinates into the following components:
3. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (2) is specifically as follows:
in the step (2.1), joint absolute coordinates and relative coordinates between joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing motion space-time domain characteristics, and the method comprises the following steps:
wherein ,is the three-dimensional coordinate of the jth joint relative to the ith joint in the f frame, and is represented as the spatial information of the jth and ith joint connected bones, when i is 1,is the absolute coordinate of the jth joint,
space-time characteristic of j-th joint is represented by matrix Qj_iExpressed as:
only the first-level and second-level relevant information with higher relevance is selected, and the first-level and second-level relevant information are respectively shown as the following formulas:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](4)
wherein, h, k; j, i; m, n represent a pair of joints connected by only one edge, p, o; u, v; y, x represent a pair of joints connected by two edges;
arranging coordinate information according to body structures, and dividing all joints of the body into the following five groups: the left arm, the right arm, the left leg, the right leg and the trunk are arranged according to the physical connection sequence among joints, and therefore a skeleton space-time characteristic diagram obtained by the coding sequence is as follows:
order toThe three-dimensional coordinates correspond to R, G, B three channels respectively, and the skeleton space-time characteristic E is obtainedkconverting into a 72 XF bone space-time characteristic diagram;
in the step (2.2), velocity information of each joint is extracted to represent motion dynamic characteristics, a feature descriptor representing joint motion characteristics is constructed based on velocity scalar quantity information, and velocity values of the joint in the x, y and z directions in the frame are represented as follows:
wherein ,the three-dimensional coordinate value of the joint in the f + delta f frame; Δ f is the time step, Δ t is:
in the formula, FPS is the frame rate of the adopted camera;
v is to bex、vy、vzRespectively correspond to R, G, B, compilethe code joint motion information is an N (F-delta F) dimension joint motion velocity map.
4. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information as claimed in claim 3, wherein: the step (3) is specifically as follows:
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the class k action sequence, the coordinates areThe instantaneous energy of the joint i in the f-th frame is:
wherein f is greater than 1; i | · | | represents the euclidean distance, and the motion energy of the joint i in the whole motion sequence is:
in the formula ,respectively the maximum value and the minimum value of all joint motion energy in the k-th motion sequence period;
according to the coding sequence, weighting all joint colors in the k-th action to omegakCoding as a motion enhancement weight:the enhanced bone space-time feature map is shown as:
(3.2) enhancing the joint movement speed characteristic pattern texture information based on a morphological operator to improve the speed estimation performance, wherein the method comprises the following steps:
firstly, carrying out corrosion operation on the joint motion velocity diagram to eliminate noise as follows:
wherein X is a binary image, theta represents corrosion operation, and E is a structural element; the formula (12) is used for the speed values v of the joints in the x, y and z directions in the frame f obtained in the step (2.2)x、vy、vzAnd (3) carrying out corrosion operation:
Iv=[vxΘE vyΘE vzΘE](13)
wherein IvVelocity diagram showing joint movement after erosion
And performing expansion operation on the corroded image:
5. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (4) is specifically as follows:
the double-current convolutional neural network model is an AlexNet model, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet model is 64, 256 and 256 respectively, a bone space-time characteristic diagram and an articulation velocity diagram are used as input of dynamic and static currents respectively, and after the bone space-time characteristic diagram and the articulation velocity diagram are processed through a convolutional layer, a pooling layer and a full connection layer, posterior probabilities generated by single-current CNN are fused into a final recognition result.
6. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information as claimed in claim 4, wherein: respectively taking a skeleton space-time characteristic diagram and a joint movement velocity diagram as the input of dynamic and static flows, and fusing the posterior probability generated by the single flow CNN into a final recognition result after the processing of a convolutional layer, a pooling layer and a full-link layer, wherein the method comprises the following steps of:
given the bone sequence Smprocessing to respectively obtain a bone space-time characteristic diagram and a joint movement velocity diagram, scaling the two to 227 × 227 pixels through bilinear interpolation to be beneficial to subsequent depth characteristic extraction, outputting the depth characteristic extracted based on CNN to the last layer of full-connected layer, and then normalizing the depth characteristic by a Softmax function to obtain the posterior probability:
wherein ,as an image of the mth bone sequenceThe probability of belonging to the nth class of action,represents the last oneInputting the nth neuron of the layer full connection layer, wherein x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories;
the double-current convolution neural network model outputs n signals at a timeAndmultiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PJMSM)) (16)
wherein Fin (-) is a maximum label function, Max (-) is a maximum operator, ⊙ is a Hadamard product operator, SSTM represents a bone space-time characteristic diagram, JMSM represents a joint motion velocity diagram, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010539760.5A CN111695523B (en) | 2020-06-15 | 2020-06-15 | Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010539760.5A CN111695523B (en) | 2020-06-15 | 2020-06-15 | Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111695523A true CN111695523A (en) | 2020-09-22 |
CN111695523B CN111695523B (en) | 2023-09-26 |
Family
ID=72480940
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010539760.5A Active CN111695523B (en) | 2020-06-15 | 2020-06-15 | Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111695523B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270246A (en) * | 2020-10-23 | 2021-01-26 | 泰康保险集团股份有限公司 | Video behavior identification method and device, storage medium and electronic equipment |
CN112861808A (en) * | 2021-03-19 | 2021-05-28 | 泰康保险集团股份有限公司 | Dynamic gesture recognition method and device, computer equipment and readable storage medium |
CN113011381A (en) * | 2021-04-09 | 2021-06-22 | 中国科学技术大学 | Double-person motion identification method based on skeleton joint data |
CN114943987A (en) * | 2022-06-07 | 2022-08-26 | 首都体育学院 | Motion behavior knowledge graph construction method adopting PAMS motion coding |
US11854305B2 (en) | 2021-05-09 | 2023-12-26 | International Business Machines Corporation | Skeleton-based action recognition using bi-directional spatial-temporal transformer |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104038738A (en) * | 2014-06-04 | 2014-09-10 | 东北大学 | Intelligent monitoring system and intelligent monitoring method for extracting coordinates of human body joint |
CN105787439A (en) * | 2016-02-04 | 2016-07-20 | 广州新节奏智能科技有限公司 | Depth image human body joint positioning method based on convolution nerve network |
US20170347107A1 (en) * | 2016-05-26 | 2017-11-30 | Mstar Semiconductor, Inc. | Bit allocation method and video encoding device |
CN109460707A (en) * | 2018-10-08 | 2019-03-12 | 华南理工大学 | A kind of multi-modal action identification method based on deep neural network |
CN109670401A (en) * | 2018-11-15 | 2019-04-23 | 天津大学 | A kind of action identification method based on skeleton motion figure |
CN109919122A (en) * | 2019-03-18 | 2019-06-21 | 中国石油大学(华东) | A kind of timing behavioral value method based on 3D human body key point |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A kind of deep video Activity recognition method and system |
CN110188599A (en) * | 2019-04-12 | 2019-08-30 | 哈工大机器人义乌人工智能研究院 | A kind of human body attitude behavior intellectual analysis recognition methods |
CN110222568A (en) * | 2019-05-05 | 2019-09-10 | 暨南大学 | A kind of across visual angle gait recognition method based on space-time diagram |
CN110253583A (en) * | 2019-07-02 | 2019-09-20 | 北京科技大学 | The human body attitude robot teaching method and device of video is taken based on wearing teaching |
CN110929637A (en) * | 2019-11-20 | 2020-03-27 | 中国科学院上海微系统与信息技术研究所 | Image identification method and device, electronic equipment and storage medium |
-
2020
- 2020-06-15 CN CN202010539760.5A patent/CN111695523B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104038738A (en) * | 2014-06-04 | 2014-09-10 | 东北大学 | Intelligent monitoring system and intelligent monitoring method for extracting coordinates of human body joint |
CN105787439A (en) * | 2016-02-04 | 2016-07-20 | 广州新节奏智能科技有限公司 | Depth image human body joint positioning method based on convolution nerve network |
US20170347107A1 (en) * | 2016-05-26 | 2017-11-30 | Mstar Semiconductor, Inc. | Bit allocation method and video encoding device |
CN109460707A (en) * | 2018-10-08 | 2019-03-12 | 华南理工大学 | A kind of multi-modal action identification method based on deep neural network |
CN109670401A (en) * | 2018-11-15 | 2019-04-23 | 天津大学 | A kind of action identification method based on skeleton motion figure |
CN109919122A (en) * | 2019-03-18 | 2019-06-21 | 中国石油大学(华东) | A kind of timing behavioral value method based on 3D human body key point |
CN110188599A (en) * | 2019-04-12 | 2019-08-30 | 哈工大机器人义乌人工智能研究院 | A kind of human body attitude behavior intellectual analysis recognition methods |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A kind of deep video Activity recognition method and system |
CN110222568A (en) * | 2019-05-05 | 2019-09-10 | 暨南大学 | A kind of across visual angle gait recognition method based on space-time diagram |
CN110253583A (en) * | 2019-07-02 | 2019-09-20 | 北京科技大学 | The human body attitude robot teaching method and device of video is taken based on wearing teaching |
CN110929637A (en) * | 2019-11-20 | 2020-03-27 | 中国科学院上海微系统与信息技术研究所 | Image identification method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
吴珍珍;邓辉舫;: "利用骨架模型和格拉斯曼流形的3D人体动作识别", 计算机工程与应用 * |
郑潇;彭晓东;王嘉璇;: "基于姿态时空特征的人体行为识别方法" * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270246A (en) * | 2020-10-23 | 2021-01-26 | 泰康保险集团股份有限公司 | Video behavior identification method and device, storage medium and electronic equipment |
CN112270246B (en) * | 2020-10-23 | 2024-01-05 | 泰康保险集团股份有限公司 | Video behavior recognition method and device, storage medium and electronic equipment |
CN112861808A (en) * | 2021-03-19 | 2021-05-28 | 泰康保险集团股份有限公司 | Dynamic gesture recognition method and device, computer equipment and readable storage medium |
CN112861808B (en) * | 2021-03-19 | 2024-01-23 | 泰康保险集团股份有限公司 | Dynamic gesture recognition method, device, computer equipment and readable storage medium |
CN113011381A (en) * | 2021-04-09 | 2021-06-22 | 中国科学技术大学 | Double-person motion identification method based on skeleton joint data |
US11854305B2 (en) | 2021-05-09 | 2023-12-26 | International Business Machines Corporation | Skeleton-based action recognition using bi-directional spatial-temporal transformer |
CN114943987A (en) * | 2022-06-07 | 2022-08-26 | 首都体育学院 | Motion behavior knowledge graph construction method adopting PAMS motion coding |
Also Published As
Publication number | Publication date |
---|---|
CN111695523B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339903B (en) | Multi-person human body posture estimation method | |
CN111695523B (en) | Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information | |
CN113128424B (en) | Method for identifying action of graph convolution neural network based on attention mechanism | |
CN110728183A (en) | Human body action recognition method based on attention mechanism neural network | |
CN114283495B (en) | Human body posture estimation method based on binarization neural network | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN113792641A (en) | High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism | |
CN115830652B (en) | Deep palm print recognition device and method | |
CN112036260A (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
Yuan et al. | STransUNet: A siamese TransUNet-based remote sensing image change detection network | |
CN116258757A (en) | Monocular image depth estimation method based on multi-scale cross attention | |
CN115063717A (en) | Video target detection and tracking method based on key area live-action modeling | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN116704596A (en) | Human behavior recognition method based on skeleton sequence | |
CN114882524A (en) | Monocular three-dimensional gesture estimation method based on full convolution neural network | |
CN114333002A (en) | Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face | |
Hang et al. | Spatial-temporal adaptive graph convolutional network for skeleton-based action recognition | |
Zhao et al. | Adaptive Dual-Stream Sparse Transformer Network for Salient Object Detection in Optical Remote Sensing Images | |
CN112149645A (en) | Human body posture key point identification method based on generation of confrontation learning and graph neural network | |
CN115331301A (en) | 6D attitude estimation method based on Transformer | |
CN113936333A (en) | Action recognition algorithm based on human body skeleton sequence | |
CN117252892B (en) | Automatic double-branch portrait matting device based on light visual self-attention network | |
CN117611428A (en) | Fashion character image style conversion method | |
CN117115855A (en) | Human body posture estimation method and system based on multi-scale transducer learning rich visual features | |
CN117315069A (en) | Human body posture migration method based on image feature alignment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |