[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111695523A - Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information - Google Patents

Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information Download PDF

Info

Publication number
CN111695523A
CN111695523A CN202010539760.5A CN202010539760A CN111695523A CN 111695523 A CN111695523 A CN 111695523A CN 202010539760 A CN202010539760 A CN 202010539760A CN 111695523 A CN111695523 A CN 111695523A
Authority
CN
China
Prior art keywords
joint
motion
space
skeleton
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010539760.5A
Other languages
Chinese (zh)
Other versions
CN111695523B (en
Inventor
王洪雁
张鼎卓
袁海
汪祖民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010539760.5A priority Critical patent/CN111695523B/en
Publication of CN111695523A publication Critical patent/CN111695523A/en
Application granted granted Critical
Publication of CN111695523B publication Critical patent/CN111695523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • G06T3/604Rotation of whole images or parts thereof using coordinate rotation digital computer [CORDIC] devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information, which belongs to the field of computer visual image and video processing and is used for solving the problem of low recognition rate of the action recognition method based on skeleton information in a complex scene, and has the key points that: (1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence; (2) constructing a skeleton space-time characteristic diagram and a joint movement velocity diagram based on the converted coordinate information; (3) respectively enhancing the characteristics of a skeleton space-time characteristic diagram and a joint motion velocity diagram based on motion significance and morphological operators; the motion classification is realized based on the bone space-time characteristic diagram and the joint motion velocity diagram which are subjected to the deep fusion and enhancement of the double-flow convolutional neural network, and the effect is that the motion identification accuracy can be effectively improved for complex scenes with visual angle change, rich noise, subtle difference motion and the like.

Description

Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information
Technical Field
The invention belongs to the field of computer visual image and video processing, and relates to a motion identification method based on skeleton space-time and dynamic characteristics and combined with a double-flow convolutional neural network (Two-Stream-CNN, TS-CNN).
Background
As a research hotspot in the field of computer vision, human body action recognition has important application value in the fields of intelligent monitoring, human-computer interaction, video retrieval and the like. The following technical difficulties are mainly faced: such methods are less robust due to factors such as illumination variations and cluttered backgrounds. The depth image information redundancy is high, so that the algorithm computation complexity is increased, and the practical application of the method is limited. And because the original bone information captured by the depth sensor contains noise and joint space-time information is fuzzy, how to effectively extract motion information through three-dimensional bone data to identify human body actions still faces huge challenges. The identification action method based on the manually extracted features has single extracted features, so the identification precision is limited and the universality is poor; based on good time sequence modeling capability of the RNN, an action recognition model is constructed by using the RNN, however, the RNN cannot effectively express an inter-joint space domain relation; based on the powerful spatial domain feature extraction capability of the CNN, the CNN is utilized to extract action features from a framework sequence coding image, each joint information is independently coded into a color image, and the traditional method mainly solves the following problems when the framework sequence is coded into a color texture map: first, each joint information is independently encoded as a color image, but the inter-joint related information is ignored; secondly, space constraint between joints is ignored, so that joint airspace information is disordered, and the recognition accuracy is limited; finally, only the static characteristics of the joints are concerned, the dynamic characteristics of the joints are ignored, and the different participation degrees of the joints in motion completion are not considered, so that the motion information coding is incomplete, the joint airspace significance information is lost, and the motion recognition rate is limited.
Disclosure of Invention
In order to solve the problems, the invention provides a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information, which can solve the problem of low recognition rate of the action recognition method based on skeleton information in a complex scene.
The invention adopts the following technical scheme: a double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information comprises the following steps:
(1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence.
(2) Constructing a skeleton space-time characteristic diagram and a joint motion velocity diagram based on the converted coordinate information, which specifically comprises the following steps:
and (2.1) coding the relative coordinates and the absolute coordinates of the joints into a skeleton space-time characteristic diagram based on human body structure constraint.
And (2.2) coding joint velocity information at the same time step into a joint motion velocity map.
(3) And respectively enhancing the characteristics of the skeleton space-time characteristic diagram and the joint motion velocity diagram based on the motion significance and the morphological operator.
(4) And realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network.
Further, the step (1) is specifically as follows:
the skeleton sequences captured by the depth sensor are all located in a Cartesian coordinate system with the camera as an origin, and coordinate system conversion is carried out on skeleton three-dimensional coordinates to obtain a body coordinate system for effectively representing airspace information, wherein the method comprises the following steps:
constructing a body coordinate system taking a hip joint with a small motion amplitude as a primary center, and for a video sequence with N joint points and F frames, converting joint coordinates into the following components:
Figure BDA0002538500200000021
wherein ,
Figure BDA0002538500200000022
respectively the coordinate information of the joint j of the f-th frame before and after the transformation of the coordinate system,
Figure BDA0002538500200000023
coordinate information of the ith frame of the hip joint.
Further, the step (2) is specifically as follows:
in the step (2.1), joint absolute coordinates and relative coordinates between joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing motion space-time domain characteristics, and the method comprises the following steps:
bone sequence based on coordinate transformation
Figure BDA0002538500200000024
The joint relative position is obtained by:
Figure BDA0002538500200000025
wherein ,
Figure BDA0002538500200000026
is the three-dimensional coordinate of the jth joint relative to the ith joint in the f frame, and is represented as the spatial information of the jth and ith joint connected bones, when i is 1,
Figure BDA0002538500200000027
is the absolute coordinate of the jth joint,
Figure BDA0002538500200000028
space-time characteristic of j-th joint is represented by matrix Qj_iExpressed as:
Figure BDA0002538500200000029
only the first-level and second-level relevant information with higher relevance is selected, and the first-level and second-level relevant information are respectively shown as the following formulas:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](4)
wherein, h, k; j, i; m, n represent a pair of joints connected by only one edge, p, o; u, v; y, x represent a pair of joints connected by two sides.
Arranging coordinate information according to body structures, and dividing all joints of the body into the following five groups: the left arm, the right arm, the left leg, the right leg and the trunk are arranged according to the physical connection sequence among joints, and therefore a skeleton space-time characteristic diagram obtained by the coding sequence is as follows:
Figure BDA00025385002000000210
wherein k is the motion category, A is the absolute coordinate of the joint point,
Figure BDA00025385002000000211
order to
Figure BDA00025385002000000212
The three-dimensional coordinates correspond to R, G, B three channels respectively, and the skeleton space-time characteristic E is obtainedkconverted to a 72 xf bone space-time profile.
In the step (2.2), velocity information of each joint is extracted to represent motion dynamic characteristics, a feature descriptor representing joint motion characteristics is constructed based on velocity scalar quantity information, and velocity values of the joint in the x, y and z directions in the frame are represented as follows:
Figure BDA0002538500200000031
wherein ,
Figure BDA0002538500200000032
the three-dimensional coordinate value of the joint in the f + delta f frame. Δ f is the time step, Δ t is:
Figure BDA0002538500200000033
where the FPS is the frame rate of the camera employed.
V is to bex、vy、vzcorresponding to R, G, B, respectively, the coded joint motion information is an N × (F-delta F) dimensional joint motion velocity map.
Further, the step (3) is specifically as follows:
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the class k action sequence, the coordinates are
Figure BDA00025385002000000313
The instantaneous energy of the joint i in the f-th frame is:
Figure BDA0002538500200000034
wherein f is greater than 1. I | · | | represents the euclidean distance, and the motion energy of the joint i in the whole motion sequence is:
Figure BDA0002538500200000035
based on kinetic energy
Figure BDA0002538500200000036
Ith Joint color weight
Figure BDA0002538500200000037
Can be obtained by the following formula:
Figure BDA0002538500200000038
in the formula ,
Figure BDA0002538500200000039
the maximum and minimum values of all joint motion energy during the class k motion sequence, respectively.
According toThe coding sequence is to weight the color of all joints in the k-th action to omegakCoding as a motion enhancement weight:
Figure BDA00025385002000000310
the enhanced bone space-time feature map is shown as:
Figure BDA00025385002000000311
(3.2) enhancing the joint movement speed characteristic pattern texture information based on a morphological operator to improve the speed estimation performance, wherein the method comprises the following steps:
firstly, carrying out corrosion operation on the joint motion velocity diagram to eliminate noise as follows:
Figure BDA00025385002000000312
wherein, X is a binary image, theta represents corrosion operation, and E is a structural element. The formula (12) is used for the speed values v of the joints in the x, y and z directions in the frame f obtained in the step (2.2)x、vy、vzAnd (3) carrying out corrosion operation:
Iv=[vxΘE vyΘE vzΘE](13)
wherein IvVelocity diagram showing joint movement after erosion
And performing expansion operation on the corroded image:
Figure BDA0002538500200000041
in the formula ,JvA graph showing the joint motion velocity after erosion and expansion, theta shows erosion operation,
Figure BDA0002538500200000042
indicating the dilation operation.
Further, the step (4) is specifically as follows:
the double-current convolutional neural network model is an AlexNet model, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet model is 64, 256 and 256 respectively, a bone space-time characteristic diagram and an articulation velocity diagram are used as input of dynamic and static currents respectively, and after the bone space-time characteristic diagram and the articulation velocity diagram are processed through a convolutional layer, a pooling layer and a full connection layer, posterior probabilities generated by single-current CNN are fused into a final recognition result.
Further, a bone space-time characteristic diagram and a joint movement velocity diagram are respectively used as input of dynamic and static flows, and after being processed by a convolutional layer, a pooling layer and a full-link layer, the posterior probability generated by the single-flow CNN is fused into a final recognition result, and the method comprises the following steps:
given the bone sequence Smprocessing to respectively obtain a bone space-time characteristic diagram and a joint movement velocity diagram, scaling the two to 227 × 227 pixels through bilinear interpolation to be beneficial to subsequent depth characteristic extraction, outputting the depth characteristic extracted based on CNN to the last layer of full-connected layer, and then normalizing the depth characteristic by a Softmax function to obtain the posterior probability:
Figure BDA0002538500200000043
wherein ,
Figure BDA0002538500200000044
as an image of the mth bone sequence
Figure BDA0002538500200000045
The probability of belonging to the nth class of action,
Figure BDA0002538500200000046
the input of the nth neuron of the last full connection layer is shown, x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories.
The double-current convolution neural network model outputs n signals at a time
Figure BDA0002538500200000047
And
Figure BDA0002538500200000048
multiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PJMSM)) (16)
wherein Fin (-) is a maximum label function, Max (-) is a maximum operator, ⊙ is a Hadamard product operator, SSTM represents a bone space-time characteristic diagram, JMSM represents a joint motion velocity diagram, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
Figure BDA0002538500200000049
has the advantages that: the invention is based on the action recognition of space-time and dynamic characteristics, and transforms the coordinate system of each type of action; constructing descriptors of skeleton space-time characteristics and motion characteristics; joint space domain information with obvious movement characteristics in the skeleton space-time characteristic diagram is enhanced, and a joint motion velocity diagram is enhanced by using a morphological operator to eliminate noise; and realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network. In the invention, because a relatively stable joint is selected as a coordinate origin to transform a skeleton sequence coordinate system, the obtained body coordinate system can effectively represent the related information between joints, and a skeleton space-time characteristic diagram is constructed by using the related information; body structure constraint is added when a skeleton sequence is coded, so that the recognition rate among different types of actions is greatly improved; in addition, after the dynamic skeleton information is added, the motion characteristic information is more comprehensively represented, so that the overall recognition rate of the invention is obviously improved; and finally, the difference between similar actions is reduced by enhancing the motion significance, and the error recognition rate between similar actions is reduced. Compared with the mainstream human body action identification method, the method has higher identification rate under the complex scenes of visual angle change, noise, main body diversity, similar action diversity and the like.
Drawings
FIG. 1 is a schematic flow chart of the main framework of the method of the present invention.
Fig. 2 shows the bone coordinates of the Kinect coordinate system.
Fig. 3 is a body coordinate system joint visualization.
Figure 4 shows a set of 25 human joints.
FIG. 5 is a joint distance graph and a proposed skeleton space time feature graph: a1 of FIG. 5 is a joint distance map; a2 in fig. 5 is a skeleton space time characteristic diagram.
FIG. 6 is an image enhanced color texture map: b1 in fig. 6 shows the motion enhancement of the skeleton space-time characteristic diagram; b2 of fig. 6 is the articulation velocity map visual enhancement.
Fig. 7 is a model of a dual-stream convolutional neural network.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
In the invention, the flow of the double-current convolutional neural network action identification method based on skeleton space-time and dynamic information is shown in the attached figure 1, and the implementation steps are as follows:
(1) carrying out coordinate system conversion on the skeleton sequence to obtain a body coordinate system taking the hip joint as a coordinate origin;
skeleton sequences captured by a depth sensor such as Kinect are all located in a cartesian coordinate system with a camera as an origin, and as shown in fig. 2, a body coordinate system for obtaining effective representation airspace information needs to perform coordinate system conversion on skeleton three-dimensional coordinates, which specifically includes:
and constructing a body coordinate system taking the hip joint with smaller motion amplitude as a primitive center. For a video sequence with N joint points, F frames, the joint coordinate transformation can be expressed as:
Figure BDA0002538500200000051
wherein ,
Figure BDA0002538500200000052
respectively the coordinate information of the joint j of the f-th frame before and after the transformation of the coordinate system,
Figure BDA0002538500200000053
coordinate information of the ith frame of the hip joint. Post-transform joint visualization is shown in fig. 3.
(2) Constructing a skeleton space-time characteristic diagram and a joint movement velocity diagram based on the converted coordinate information;
step (2.1): the relative coordinates between joints and the absolute coordinates of the joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing the space-time domain characteristics of actions, and the method comprises the following steps:
bone sequence based on coordinate transformation
Figure BDA0002538500200000061
The joint relative position can be obtained by:
Figure BDA0002538500200000062
wherein ,
Figure BDA0002538500200000063
and the three-dimensional coordinates of the jth joint relative to the ith joint in the f frame are represented, and the spatial information of the j and i joint connected bones is also represented. Further, when i is 1,
Figure BDA0002538500200000064
as absolute coordinates of the j-th joint, i.e.
Figure BDA0002538500200000065
Based on the above, the space-time characteristics of the j-th joint can be represented by the matrix Qj_iExpressed as:
Figure BDA0002538500200000066
in the invention, only the first and second (namely only one or two joint pairs with connected edges) level related information with higher correlation degree is selected, so that the calculation complexity is reduced, the inter-class confusion is reduced, and the intra-class robustness is improved. The first and second levels of relevant information are respectively shown as follows:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](21)
wherein, h, k; j, i; m, n, etc. represent joint pairs connected by only one side, such as left wrist and left elbow, left ankle and left knee, p, o; u, v; y, x, etc. represent joint pairs connected by two sides, such as left wrist and shoulder, left foot and knee, etc.
Since the sensitivity area of CNN increases with the depth of the network, the spatial information between the joint pairs with high correlation should be extracted in a shallow layer, and the spatial information with low correlation should be acquired in a deep layer. The proposed joint distance map, as shown in a1 of fig. 5, arranges joint information in a color image in a fixed order while ignoring the difference in relative spatial information, arranges coordinate information in accordance with a body structure, and divides all joints into the following five groups: left arm, right arm, left leg, right leg, torso, each group is arranged according to the physical connection order between joints, as shown in fig. 4. Taking the right arm as an example, the joint points [25,24,12,11,10,9] are adjacent in fig. 4, so that the correlation degree is higher, and the spatial relationship between the joint points can be more effectively extracted by grouping the joint points into a group. Based on the above, the resulting bone space-time feature map can effectively encode the space-time domain information of the joint, as shown in a2 of fig. 5.
The bone space-time characteristic diagram obtained based on the coded bone sequence is as follows:
Figure BDA0002538500200000067
wherein k is the motion category, A is the absolute coordinate of the joint point,
Figure BDA0002538500200000068
order to
Figure BDA0002538500200000069
The three-dimensional coordinates correspond to R, G, B three channels respectively, so that the space-time characteristics E of the skeleton can be obtainedkconverted to a 72 xf bone space-time profile.
And (2.2) extracting the speed information of each joint to represent the dynamic characteristics of the motion, and constructing a characteristic descriptor representing the motion characteristics of the joint based on the speed scalar information. The velocity values of the joint in x, y, z in f frames can be expressed as follows:
Figure BDA0002538500200000071
wherein ,
Figure BDA0002538500200000072
the three-dimensional coordinate value of the joint in the f + delta f frame; Δ f is the time step, Δ t is:
Figure BDA0002538500200000073
wherein FPS is the frame rate of the Kinect camera employed.
V is to bex、vy、vzcorresponding to R, G, B, respectively, the joint motion information can be encoded as an N × (F- Δ F) dimensional joint motion velocity map.
(3) Based on the method for respectively enhancing the characteristics of the skeleton space-time characteristic diagram and the joint motion velocity diagram based on the motion significance and the morphological operator, the inter-class difference of different actions is improved, and the intra-class difference of the same actions is reduced;
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the kth motion sequence, the coordinates are
Figure BDA0002538500200000074
The instantaneous energy of the joint i in the f-th frame is:
Figure BDA0002538500200000075
wherein f is greater than 1; i | · | | represents the euclidean distance. From this, the motion energy of the joint i in the whole motion sequence is:
Figure BDA0002538500200000076
based on kinetic energy
Figure BDA0002538500200000077
Ith Joint color weight
Figure BDA0002538500200000078
Can be obtained by the following formula:
Figure BDA0002538500200000079
in the formula ,
Figure BDA00025385002000000710
the maximum and minimum values of all joint motion energy during the kth motion sequence, respectively.
According to the coding sequence, all joint color weights omega in the k-th actionkCoding as a motion enhancement weight:
Figure BDA00025385002000000711
the enhanced bone spatiotemporal feature map may be represented as:
Figure BDA00025385002000000712
as shown in b1 of fig. 6, the color corresponding to the joint related information with high motion energy is enhanced, and the color information of the joint with low motion energy is blurred, so that the motion classification capability can be improved by adopting the proposed adaptive enhancement mode to make the skeleton space-time feature map have motion saliency features.
And (3.2) enhancing the motion characteristic map texture information based on a morphological operator to improve the speed estimation performance. The method firstly carries out corrosion operation on the joint motion velocity diagram to eliminate noise, namely:
Figure BDA00025385002000000713
wherein, X is a binary image, theta represents corrosion operation, and E is a structural element.
V obtained by the reaction of formula (12) on step (2.2)x、vy、vzCarrying out corrosion operation, namely:
Iv=[vxΘE vyΘE vzΘE](30)
and performing expansion operation on the corroded image to restore and smooth the original texture so as to effectively reduce the intra-class speed difference. Adding expansion operation to obtain:
Figure BDA0002538500200000081
wherein IvA graph showing joint motion velocity after erosion;
and performing expansion operation on the corroded image to restore and smooth the original texture so as to effectively reduce the intra-class speed difference. Adding expansion operation to obtain:
Figure BDA0002538500200000082
in the formula ,JvA graph showing the joint motion velocity after erosion and expansion, theta shows erosion operation,
Figure BDA0002538500200000083
indicating the dilation operation.
As shown in b2 of fig. 6, compared to the original image (first line), the texture of the enhanced image (second line) is smoother, and under the condition that the original texture is kept basically unchanged, useless information is effectively removed, so that the difference between similar actions is reduced.
(4) Constructing a bone space-time characteristic diagram and a joint motion velocity diagram based on the deep fusion and enhancement of the double-current convolution neural network to realize action classification;
the dual-current convolutional neural network model is composed of two improved alexnets, and as shown in fig. 7, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet is changed from 96, 384 and 384 to 64, 256 and 256, respectively, so as to form the dual-current convolutional neural network model in the invention.
And respectively taking the skeleton space-time characteristic diagram and the joint movement velocity diagram as the input of the dynamic and static flows, and fusing the posterior probabilities generated by the single flow CNN into a final recognition result after the processing of the convolutional layer, the pooling layer and the full-link layer.
Given the bone sequence Smthe bone space-time feature map and the joint movement velocity map can be obtained through the processing, and the two maps are scaled to 227 × 227 pixels through bilinear interpolation so as to be beneficial to subsequent depth feature extraction, the depth features extracted based on the CNN are output to the last layer of fully-connected layer, and then are normalized by a Softmax function, and the posterior probability can be obtained as follows:
Figure BDA0002538500200000084
wherein ,
Figure BDA0002538500200000085
as an image of the mth bone sequence
Figure BDA0002538500200000086
The probability of belonging to the nth class of action,
Figure BDA0002538500200000087
the input of the nth neuron of the last full connection layer is shown, x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories.
In the proposed model, n outputs are output at a time
Figure BDA0002538500200000088
And
Figure BDA0002538500200000089
multiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PMSM)) (34)
wherein Fin (-) is the maximum label function, Max (-) is the maximum operator, as Hadamard product operator, SSTM indicates the boneSpace-time characteristic diagram, JMSM representing the velocity diagram of joint movement, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
Figure BDA0002538500200000091
the invention relates to a double-current convolution neural network action identification method based on skeleton space-time and dynamic information, which comprises the steps of firstly transforming a skeleton three-dimensional coordinate system to obtain coordinate information containing relative positions of joints; secondly, coding the related information among joints into a color texture map to construct a skeleton space-time feature descriptor, and considering the physical structure constraint of a human body to increase the difference among classes; then, estimating the velocity information of each joint, and coding the velocity information into a color texture map to obtain a skeleton motion characteristic descriptor; in addition, the obtained space-time and dynamic characteristics are respectively enhanced based on the motion significance and the morphological operator so as to further improve the characteristic expression capability; and finally, the enhanced bone space-time and dynamic characteristics are deeply fused through a double-flow convolutional neural network to realize action recognition. Aiming at complex scenes with visual angle change, rich noise, subtle difference action and the like, the method can effectively improve the action recognition accuracy.
The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims (6)

1. A double-current convolutional neural network action recognition method based on skeleton space-time and dynamic information is characterized by comprising the following steps: the method comprises the following steps:
(1) inputting a skeleton sequence, and performing coordinate system conversion on the obtained skeleton sequence;
(2) constructing a skeleton space-time characteristic diagram and a joint motion velocity diagram based on the converted coordinate information, which specifically comprises the following steps:
(2.1) coding the relative coordinates and the absolute coordinates of the joints into a skeleton space-time characteristic diagram based on human body structure constraint;
(2.2) encoding joint velocity information at the same time step into a joint motion velocity map;
(3) respectively enhancing the characteristics of a skeleton space-time characteristic diagram and a joint motion velocity diagram based on motion significance and morphological operators;
(4) and realizing action classification based on the enhanced bone space-time characteristic diagram and the joint motion velocity diagram of the deep fusion of the double-flow convolutional neural network.
2. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (1) is specifically as follows:
the skeleton sequences captured by the depth sensor are all located in a Cartesian coordinate system with the camera as an origin, and coordinate system conversion is carried out on skeleton three-dimensional coordinates to obtain a body coordinate system for effectively representing airspace information, wherein the method comprises the following steps: :
constructing a body coordinate system taking a hip joint with a small motion amplitude as a primary center, and for a video sequence with N joint points and F frames, converting joint coordinates into the following components:
Figure FDA0002538500190000011
wherein ,
Figure FDA0002538500190000012
respectively the coordinate information of the joint j of the f-th frame before and after the transformation of the coordinate system,
Figure FDA0002538500190000013
coordinate information of the ith frame of the hip joint.
3. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (2) is specifically as follows:
in the step (2.1), joint absolute coordinates and relative coordinates between joints are jointly coded into a color texture map to form a skeleton space-time characteristic map representing motion space-time domain characteristics, and the method comprises the following steps:
bone sequence based on coordinate transformation
Figure FDA0002538500190000014
The joint relative position is obtained by:
Figure FDA0002538500190000015
wherein ,
Figure FDA0002538500190000016
is the three-dimensional coordinate of the jth joint relative to the ith joint in the f frame, and is represented as the spatial information of the jth and ith joint connected bones, when i is 1,
Figure FDA0002538500190000017
is the absolute coordinate of the jth joint,
Figure FDA0002538500190000018
space-time characteristic of j-th joint is represented by matrix Qj_iExpressed as:
Figure FDA0002538500190000021
only the first-level and second-level relevant information with higher relevance is selected, and the first-level and second-level relevant information are respectively shown as the following formulas:
R1=[Qh_k,Qj_i,…,Qm_n],R2=[Qp_o,Qu_v,…,Qy_x](4)
wherein, h, k; j, i; m, n represent a pair of joints connected by only one edge, p, o; u, v; y, x represent a pair of joints connected by two edges;
arranging coordinate information according to body structures, and dividing all joints of the body into the following five groups: the left arm, the right arm, the left leg, the right leg and the trunk are arranged according to the physical connection sequence among joints, and therefore a skeleton space-time characteristic diagram obtained by the coding sequence is as follows:
Figure FDA0002538500190000022
wherein k is the motion category, A is the absolute coordinate of the joint point,
Figure FDA0002538500190000023
order to
Figure FDA0002538500190000024
The three-dimensional coordinates correspond to R, G, B three channels respectively, and the skeleton space-time characteristic E is obtainedkconverting into a 72 XF bone space-time characteristic diagram;
in the step (2.2), velocity information of each joint is extracted to represent motion dynamic characteristics, a feature descriptor representing joint motion characteristics is constructed based on velocity scalar quantity information, and velocity values of the joint in the x, y and z directions in the frame are represented as follows:
Figure FDA0002538500190000025
wherein ,
Figure FDA0002538500190000026
the three-dimensional coordinate value of the joint in the f + delta f frame; Δ f is the time step, Δ t is:
Figure FDA0002538500190000027
in the formula, FPS is the frame rate of the adopted camera;
v is to bex、vy、vzRespectively correspond to R, G, B, compilethe code joint motion information is an N (F-delta F) dimension joint motion velocity map.
4. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information as claimed in claim 3, wherein: the step (3) is specifically as follows:
(3.1) enhancing joint airspace information with obvious movement characteristics in the skeleton space-time characteristic diagram based on motion energy, which specifically comprises the following steps:
during the class k action sequence, the coordinates are
Figure FDA0002538500190000031
The instantaneous energy of the joint i in the f-th frame is:
Figure FDA0002538500190000032
wherein f is greater than 1; i | · | | represents the euclidean distance, and the motion energy of the joint i in the whole motion sequence is:
Figure FDA0002538500190000033
based on kinetic energy
Figure FDA0002538500190000034
Ith Joint color weight
Figure FDA0002538500190000035
Can be obtained by the following formula:
Figure FDA0002538500190000036
in the formula ,
Figure FDA0002538500190000037
respectively the maximum value and the minimum value of all joint motion energy in the k-th motion sequence period;
according to the coding sequence, weighting all joint colors in the k-th action to omegakCoding as a motion enhancement weight:
Figure FDA0002538500190000038
the enhanced bone space-time feature map is shown as:
Figure FDA0002538500190000039
(3.2) enhancing the joint movement speed characteristic pattern texture information based on a morphological operator to improve the speed estimation performance, wherein the method comprises the following steps:
firstly, carrying out corrosion operation on the joint motion velocity diagram to eliminate noise as follows:
Figure FDA00025385001900000310
wherein X is a binary image, theta represents corrosion operation, and E is a structural element; the formula (12) is used for the speed values v of the joints in the x, y and z directions in the frame f obtained in the step (2.2)x、vy、vzAnd (3) carrying out corrosion operation:
Iv=[vxΘE vyΘE vzΘE](13)
wherein IvVelocity diagram showing joint movement after erosion
And performing expansion operation on the corroded image:
Figure FDA00025385001900000311
in the formula ,JvA graph showing the joint motion velocity after erosion and expansion, theta shows erosion operation,
Figure FDA00025385001900000312
indicating the dilation operation.
5. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information according to claim 1, wherein the method comprises the following steps: the step (4) is specifically as follows:
the double-current convolutional neural network model is an AlexNet model, the number of neurons in the first layer, the third layer and the fourth layer of the AlexNet model is 64, 256 and 256 respectively, a bone space-time characteristic diagram and an articulation velocity diagram are used as input of dynamic and static currents respectively, and after the bone space-time characteristic diagram and the articulation velocity diagram are processed through a convolutional layer, a pooling layer and a full connection layer, posterior probabilities generated by single-current CNN are fused into a final recognition result.
6. The method for identifying the action of the double-current convolutional neural network based on the skeletal spatiotemporal and dynamic information as claimed in claim 4, wherein: respectively taking a skeleton space-time characteristic diagram and a joint movement velocity diagram as the input of dynamic and static flows, and fusing the posterior probability generated by the single flow CNN into a final recognition result after the processing of a convolutional layer, a pooling layer and a full-link layer, wherein the method comprises the following steps of:
given the bone sequence Smprocessing to respectively obtain a bone space-time characteristic diagram and a joint movement velocity diagram, scaling the two to 227 × 227 pixels through bilinear interpolation to be beneficial to subsequent depth characteristic extraction, outputting the depth characteristic extracted based on CNN to the last layer of full-connected layer, and then normalizing the depth characteristic by a Softmax function to obtain the posterior probability:
Figure FDA0002538500190000041
wherein ,
Figure FDA0002538500190000042
as an image of the mth bone sequence
Figure FDA0002538500190000043
The probability of belonging to the nth class of action,
Figure FDA0002538500190000044
represents the last oneInputting the nth neuron of the layer full connection layer, wherein x represents a skeleton space-time characteristic diagram or a joint movement velocity diagram, and N is the number of action categories;
the double-current convolution neural network model outputs n signals at a time
Figure FDA0002538500190000045
And
Figure FDA0002538500190000046
multiply fusing is applied to each stream output to obtain the final classification result:
ActionClass=Fin(Max(PSSTM⊙PJMSM)) (16)
wherein Fin (-) is a maximum label function, Max (-) is a maximum operator, ⊙ is a Hadamard product operator, SSTM represents a bone space-time characteristic diagram, JMSM represents a joint motion velocity diagram, PSSTMOutput value, P, for static flow softmaxJMSMAnd outputting a value for the dynamic stream softmax, wherein the value is respectively represented as:
Figure FDA0002538500190000047
CN202010539760.5A 2020-06-15 2020-06-15 Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information Active CN111695523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010539760.5A CN111695523B (en) 2020-06-15 2020-06-15 Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010539760.5A CN111695523B (en) 2020-06-15 2020-06-15 Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information

Publications (2)

Publication Number Publication Date
CN111695523A true CN111695523A (en) 2020-09-22
CN111695523B CN111695523B (en) 2023-09-26

Family

ID=72480940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010539760.5A Active CN111695523B (en) 2020-06-15 2020-06-15 Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information

Country Status (1)

Country Link
CN (1) CN111695523B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270246A (en) * 2020-10-23 2021-01-26 泰康保险集团股份有限公司 Video behavior identification method and device, storage medium and electronic equipment
CN112861808A (en) * 2021-03-19 2021-05-28 泰康保险集团股份有限公司 Dynamic gesture recognition method and device, computer equipment and readable storage medium
CN113011381A (en) * 2021-04-09 2021-06-22 中国科学技术大学 Double-person motion identification method based on skeleton joint data
CN114943987A (en) * 2022-06-07 2022-08-26 首都体育学院 Motion behavior knowledge graph construction method adopting PAMS motion coding
US11854305B2 (en) 2021-05-09 2023-12-26 International Business Machines Corporation Skeleton-based action recognition using bi-directional spatial-temporal transformer

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038738A (en) * 2014-06-04 2014-09-10 东北大学 Intelligent monitoring system and intelligent monitoring method for extracting coordinates of human body joint
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
US20170347107A1 (en) * 2016-05-26 2017-11-30 Mstar Semiconductor, Inc. Bit allocation method and video encoding device
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN109670401A (en) * 2018-11-15 2019-04-23 天津大学 A kind of action identification method based on skeleton motion figure
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
CN110188599A (en) * 2019-04-12 2019-08-30 哈工大机器人义乌人工智能研究院 A kind of human body attitude behavior intellectual analysis recognition methods
CN110222568A (en) * 2019-05-05 2019-09-10 暨南大学 A kind of across visual angle gait recognition method based on space-time diagram
CN110253583A (en) * 2019-07-02 2019-09-20 北京科技大学 The human body attitude robot teaching method and device of video is taken based on wearing teaching
CN110929637A (en) * 2019-11-20 2020-03-27 中国科学院上海微系统与信息技术研究所 Image identification method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038738A (en) * 2014-06-04 2014-09-10 东北大学 Intelligent monitoring system and intelligent monitoring method for extracting coordinates of human body joint
CN105787439A (en) * 2016-02-04 2016-07-20 广州新节奏智能科技有限公司 Depth image human body joint positioning method based on convolution nerve network
US20170347107A1 (en) * 2016-05-26 2017-11-30 Mstar Semiconductor, Inc. Bit allocation method and video encoding device
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN109670401A (en) * 2018-11-15 2019-04-23 天津大学 A kind of action identification method based on skeleton motion figure
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110188599A (en) * 2019-04-12 2019-08-30 哈工大机器人义乌人工智能研究院 A kind of human body attitude behavior intellectual analysis recognition methods
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
CN110222568A (en) * 2019-05-05 2019-09-10 暨南大学 A kind of across visual angle gait recognition method based on space-time diagram
CN110253583A (en) * 2019-07-02 2019-09-20 北京科技大学 The human body attitude robot teaching method and device of video is taken based on wearing teaching
CN110929637A (en) * 2019-11-20 2020-03-27 中国科学院上海微系统与信息技术研究所 Image identification method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴珍珍;邓辉舫;: "利用骨架模型和格拉斯曼流形的3D人体动作识别", 计算机工程与应用 *
郑潇;彭晓东;王嘉璇;: "基于姿态时空特征的人体行为识别方法" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270246A (en) * 2020-10-23 2021-01-26 泰康保险集团股份有限公司 Video behavior identification method and device, storage medium and electronic equipment
CN112270246B (en) * 2020-10-23 2024-01-05 泰康保险集团股份有限公司 Video behavior recognition method and device, storage medium and electronic equipment
CN112861808A (en) * 2021-03-19 2021-05-28 泰康保险集团股份有限公司 Dynamic gesture recognition method and device, computer equipment and readable storage medium
CN112861808B (en) * 2021-03-19 2024-01-23 泰康保险集团股份有限公司 Dynamic gesture recognition method, device, computer equipment and readable storage medium
CN113011381A (en) * 2021-04-09 2021-06-22 中国科学技术大学 Double-person motion identification method based on skeleton joint data
US11854305B2 (en) 2021-05-09 2023-12-26 International Business Machines Corporation Skeleton-based action recognition using bi-directional spatial-temporal transformer
CN114943987A (en) * 2022-06-07 2022-08-26 首都体育学院 Motion behavior knowledge graph construction method adopting PAMS motion coding

Also Published As

Publication number Publication date
CN111695523B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN111339903B (en) Multi-person human body posture estimation method
CN111695523B (en) Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN110728183A (en) Human body action recognition method based on attention mechanism neural network
CN114283495B (en) Human body posture estimation method based on binarization neural network
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN115830652B (en) Deep palm print recognition device and method
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
Yuan et al. STransUNet: A siamese TransUNet-based remote sensing image change detection network
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
CN115063717A (en) Video target detection and tracking method based on key area live-action modeling
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN116704596A (en) Human behavior recognition method based on skeleton sequence
CN114882524A (en) Monocular three-dimensional gesture estimation method based on full convolution neural network
CN114333002A (en) Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face
Hang et al. Spatial-temporal adaptive graph convolutional network for skeleton-based action recognition
Zhao et al. Adaptive Dual-Stream Sparse Transformer Network for Salient Object Detection in Optical Remote Sensing Images
CN112149645A (en) Human body posture key point identification method based on generation of confrontation learning and graph neural network
CN115331301A (en) 6D attitude estimation method based on Transformer
CN113936333A (en) Action recognition algorithm based on human body skeleton sequence
CN117252892B (en) Automatic double-branch portrait matting device based on light visual self-attention network
CN117611428A (en) Fashion character image style conversion method
CN117115855A (en) Human body posture estimation method and system based on multi-scale transducer learning rich visual features
CN117315069A (en) Human body posture migration method based on image feature alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant