A kind of multi-modal action identification method based on deep neural network
Technical field
The present invention relates to technical field of image processing, and in particular to a kind of multi-modal movement knowledge based on deep neural network
Other method.
Background technique
Action recognition is a recently very popular research direction, passes through the movement of human body in identification video, Ke Yizuo
Processing equipment is interactively entered for a kind of new, the application field of the everyday exposures such as game, video display can be widely used in
In.Action recognition task is related to identifying different movements from video clip, and possibly through entire video, this is for movement therein
A kind of natural expansion of image classification task, i.e., carry out image recognition in multi-frame video, then calculates from each frame
The prediction result finally acted.
Traditional video actions identification technology tends to rely on the feature extractor of hand-designed, to extract the space-time of movement
Feature.With being announced to the world splendidly for deep learning, this kind of craft feature extractor is substituted by depth convolutional neural networks.
Although deep learning frame achieves success, visual classification and expression in the field image classification (ImageNet)
The framework of learning areas is made slow progress.Mainly huge calculating cost, the two-dimensional convolution nerve net of simple 101 classification
Network only has about 5M parameter, and it is about 33M parameter that same architecture, which expands to a three-dimensional structure to increase,.On UCF101
One Three dimensional convolution neural network (3DConvNet) of training needs 3 to 4 days time, and about 2 are needed on Sports-1M
Month, this becomes difficult extension framework probes and may overfitting.
Action recognition is related to capturing the space-time context across frame, in addition, the spatial information captured needs the auxiliary of hardware device
It helps, is generally compensated using for camera movement, and even if having very strong spatial object detectability that can not meet
The demand of movement, because more detailed minutia entrained by motion information is mined out not yet.In order to more pre-
It surveys, needs to capture the motion information of local context in video, while capturing the motion information of global context.
Nowadays video actions identification technology has used depth learning technology completely, wherein classical work is double-current convolution mind
Through network.The it is proposed of double-current convolutional neural networks is actually to have used for reference the double-current access that information is handled in brain vision system,
Wherein: veutro access (what path way) is for handling the spatial informations such as shape, the color of object;Back side (where path
Way) access for handle to move, the relevant information in position.Although this method is mentioned by obviously capturing local time's movement
The performance for having risen single stream method, since the prediction of video level is obtained by the prediction score of average sample editing, in institute
In the feature of study, medium-term and long-term temporal information is still loss.Therefore, there are also can much mention double-current video frequency identifying method
The space risen.
Summary of the invention
The purpose of the present invention is to solve drawbacks described above in the prior art, provide a kind of based on deep neural network
Multi-modal action identification method.This method increases this mode of human skeleton on double-current convolutional network network foundation.Due to
Human body attitude estimates that relatively low (human skeleton, that is, key point has a very strong correlation relationship to difficulty, therefore can be in combination with
The clue of bottom-up and top-down is positioned), and have mature Open Framework such as AlphaPose etc., it is introduced into
Into action recognition, the interference of extraneous background on the one hand can be eliminated, on the other hand, the human motion of frame sequence meticulous depiction
When each key point position situation of change, be conducive to the identification of movement.This method uses the depth mind with multiple-branching construction
Multi-modal action recognition is carried out through network, wherein image branch is for handling the spatial informations such as shape, the color of object;Light stream branch
Road for handle to move, the relevant information in position;Skeleton branch is reached pair by the path integral feature of processing frame sequence
The meticulous depiction of movement.In addition, the invention also introduces a kind of pond method based on attention mechanism in image branch, so that
Lime light can be placed on the area-of-interest closely bound up with action classification by image branch automatically, further increase action recognition
The accuracy of method.
The purpose of the present invention can be reached by adopting the following technical scheme that:
A kind of multi-modal action identification method based on deep neural network, the action identification method include the following steps:
S1, pass through the disclosed database of acquisition, every frame image of video data is converted into RGB picture set, name rule
Data separation then is carried out as filename according to video name+time+movement id, data construct training according to the ratio of 3:1 here
Set and test set, wherein movement id includes following six kinds of basic movements: walking, run, wave, bend over, jump, stand.
S2, the data acquisition system unified resolution that step S1 is obtained.
S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from
Cosine transform is dissipated to compress the Pixel Information of every frame video pictures.
S4, to step S3 treated video data according to time dimension, by video of the time interval in interval threshold
The video frame deletion of frame or picture similarity more than similar threshold value.
S5, to the Optic flow information of step S4 treated data extract N number of successive video frames, wherein N is more than or equal to 10
Positive integer.
S6, using open source Attitude estimation algorithm such as AlphaPose etc., human skeleton is extracted by frame to video, thus
Path integral feature is sought to a frame sequence, and to the frame sequence.
S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and
Step S4 treated several video images, the input as deep neural network.There are three deep neural network is total in low layer
Branch --- one is used for the convolutional neural networks of extraction time feature, and one for extracting the convolutional Neural net of space characteristics
Network, one for handling the fully-connected network of skeleton path integral feature.In high level, three branches of low layer are melted by feature
A branch is merged into conjunction, and the classification id of video actions is predicted by softmax activation primitive.
Further, the database of step S1 acquisition mainly includes KTH human body behavior database, UCF Sports data
Library.
Further, step S2 is unified to 120*90 by video image resolution ratio.
Further, step S3 carries out discrete cosine transform to every frame image of video data, by transformed DCT
Coefficient carries out thresholding operation, and the coefficient for being less than certain threshold value is zeroed, so that compression ratio is 10:1, then carries out inverse DCT fortune
It calculates, obtains the single-frame images of compressed video data.
Further, the similar variation of the video frame for being spaced in 500ms is greater than 70% view according to time dimension by step S4
Frequency frame deletion reduces redundancy.Wherein, the value range of the interval threshold of time interval is 400ms-1000ms, representative value
For 500ms.The value range of the similar threshold value of picture similarity is 0.5-0.9, representative value 0.7.
Further, the Optic flow information that the processing data of step S5 are extracted with 10 successive video frames, mainly passes through
Lucas-Kanade algorithm solves basic optical flow equation to all pixels in neighborhood using least square principle, finally obtains required
Optic flow information.
Further, step S6 extracts human body by frame to video using open source Attitude estimation algorithm such as AlphaPose etc.
Skeleton to obtain a frame sequence, and seeks path integral feature to the frame sequence.
Further, the human skeleton road that step S7 extracts Optic flow information that step S5 is extracted, step S6
Diameter integrates feature and step S4 treated several video images, is input in deep neural network.The deep neural network
Network structure is as follows:
There are three branches, i.e. image branch, light stream branch and skeleton branch in low layer tool for the deep neural network, right respectively
The input of Ying Yusan mode;Three branches of low layer are merged into a branch by Fusion Features in high level, wherein image
Branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer pooling1, volume
Lamination conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond layer attention
Pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss;
Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, Chi Hua
Layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond
Change layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer
loss;
Skeleton branch use fully-connected network, be followed successively by from input layer to output layer full articulamentum fc1, Quan Lian stratum fc2,
Data aggregation layer fusion, loss function layer loss.
Further, the data aggregation layer fusion acts input video by softmax activation primitive and carries out
Classification, and optimize the parameter of network by minimizing Classification Loss function.
Further, in the step S7 image branch pool layers of introducing attention mechanism of attention, will roll up
Two groups of construction of the parameter input weight vectors for study after product, respectively the conspicuousness weight vectors b from bottom-up
With the attention weight vectors a from top-down, matrix operation implementation is reused to the bottom-up conspicuousness of Projection Character
The weighting of weighted sum top-down attention, the response for finally merging the two again obtain final result;Assuming that the feature to pond is X
And X ∈ Rn×f, a, b ∈ Rf×1, wherein n is the bulk to pond Projection Character, and f is the port number to pond Projection Character
Amount,Represent feature X and carried out the perspective view after the significance weighted from bottom-up, the perspective view with
Specific classification is unrelated, represents the perspective view after feature X has carried out the attention weighting from top-down with a thickness of 1, Xa,
Since different classifications should have different attention weight vectors a, enabling categorical measure is K, then the attention of all categories
Weight matrix is A ∈ Rf×K, top-down attention perspective view formula isFinal every a piece of particular category
Attention projects Xa and first carries out being multiplied by element with conspicuousness projection Xb, then sums to multiplied result, obtains the category
The eigenmatrix of attention weighting.
Further, data aggregation layer fusion classifies to input video movement by softmax activation primitive, and
By minimizing Classification Loss function, the parameter of Lai Youhua network.The training of the network model is not only restricted to specifically train frame
Caffe frame, MxNet frame, Torch frame and Tesorflow frame etc. can be used in frame.
The present invention has the following advantages and effects with respect to the prior art:
(1) a kind of multi-modal action identification method based on deep neural network disclosed by the invention has been used with more
The deep neural network of branched structure, wherein image branch is for handling the spatial informations such as shape, the color of object;Light stream branch
For handle to move, the relevant information in position;Skeleton branch is reached by the path integral feature of processing frame sequence to dynamic
The meticulous depiction of work.
(2) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, first by locating in advance
Reason reduces the calculation amount of network, to be substantially reduced operation time, and can comprehensively utilize video image, light stream figure and human body
The multi-modal informations such as skeleton significantly improve video actions accuracy of identification.
(3) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, in image branch
Pond layer operation introduces a kind of attention weighting pondization operation, it can voluntarily each pond unit be arrived in study in training
Weight, the bigger pond unit of weight corresponds to the abstract characteristics closely bound up with the movement, and the Chi Huadan that weight is smaller
Member corresponds to other features that should ignore or can generate interference to action recognition.By based on attention mechanism
After pooling structure, the feature unrelated with action classification will be ignored, and will be by with closely bound up feature is acted
" amplification " improves the accuracy rate and precision of action recognition.
Detailed description of the invention
Fig. 1 is a kind of multi-modal action identification method model signal based on deep neural network disclosed in the present invention
Figure;
Fig. 2 is the pooling Structure Calculation schematic diagram proposed by the present invention based on attention mechanism;
Fig. 3 is a kind of process of the multi-modal action identification method method based on deep neural network disclosed in the present invention
Figure.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment
As shown in Figure 1, present embodiment discloses a kind of multi-modal action identification method based on deep neural network.
There are three branches altogether in low layer for the deep neural network that the present embodiment uses --- and one is used for extraction time feature
Convolutional neural networks, one for extracting the convolutional neural networks of space characteristics, one is special for handling skeleton path integral
The fully-connected network of sign.In high level, three branches merge into a branch by Fusion Features, and activate letter by softmax
The classification id of number prediction video actions.In image branch, a kind of pooling structure based on attention mechanism is introduced, it can
On the basis of not changing existing network infrastructure, network structure is helped to focus on the feature for being conducive to identification maneuver, to reduce nothing
The interference for closing feature, improves the performance of existing network, and video human motion recognition system is enabled more effectively to be applied to engineering.
As the embodiment of the present invention, the complete training precision that can be improved model of training data, in addition to data into
Row pretreatment and compression, can be further reduced the interference of redundancy and irrelevant information, reduce the calculation amount of model, thus
It reduces the model training time and improves training precision, therefore as the embodiment of the present invention, the multimode based on deep neural network
State action identification method is as follows:
The collection of S1, training data
It mainly include following data library: KTH human body behavior database, UCF Sports by acquiring disclosed database
Every frame image of video data is converted to RGB picture set by database, and naming rule is made according to video name+time+movement id
Data separation is carried out for filename, data are gathered according to the ratio construction training set of 3:1 and test here, wherein movement id packet
Containing six kinds of basic movements: walking, run, wave, bend over, jump, stand.
S2, the data acquisition system obtained to step S1 normalize, i.e. unified resolution size, i.e., to the picture specification of every frame into
It, is uniformly arrived the resolution ratio of 120*90 by row compression, on the basis of guaranteeing the information integrity of image as far as possible, reduces convolution mind
Calculation amount through network model improves recognition speed.
S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from
It dissipates cosine transform DCT to compress the Pixel Information of every frame video pictures, compression ratio 10:1 can reduce initialization with this
Information content when processing.Discrete cosine transform is carried out to original image, transformed DCT coefficient is subjected to thresholding operation, it will
Coefficient less than threshold value is zeroed, compression quantization image process, then carries out inverse DCT operation, available final compressed figure
Picture.
S4, to step S3 treated video data according to time dimension, by video frame of the time interval in 500ms or
Video frame deletion of person's picture similarity 0.7 or more reduces redundancy.
Wherein, the method and step for calculating picture similarity is as follows:
S41, scaling pictures: being general size 8*8,64 pixel values by picture compression;
S42, simplify color, be converted into grayscale image;
S43, it calculates average value: calculating the average value of the pixel value of grayscale image all pixels point;
S44, compared pixels gray value: the average value that each pixel value and previous step for traversing grayscale image calculate is greater than
Average value is recorded as 1, is otherwise 0;
S45,64 bit image fingerprints are obtained;
S46, calculate two pictures finger image Hamming distance, using the Hamming distance as picture similarity.
S5, optical flow method mainly utilize in image sequence pixel in the variation in time-domain and the correlation between consecutive frame
Property finds previous frame with corresponding relationship existing between present frame, to calculate the motion information of object between consecutive frame.
Lucas-Kanade method is a kind of widely used light stream estimation difference method, is owned using least square principle in neighborhood
Pixel solves basic optical flow equation, compares common point by point method, and Lucas-Kanade algorithm is more insensitive for picture noise.
Therefore the bi-directional light of 10 successive video frames is extracted using Lucas-Kanade algorithm to step S4 treated video requency frame data
Stream information.Wherein, Lucas-Kanade algorithm is Lucas B and Kanade T.An Iterative Image
Registration Technique with an Application to Stereo Vision.Proc.Of 7th
International Joint Conference on Artificial Intelligence (IJCAI), pp.674-679 opinion
The method that text is mentioned, has been carried out in openCV, therefore it is using the Lucas- on openCV that Optic flow information is extracted in this realization
Kanade extracts Optic flow information.
S6, path integral feature passage path iterated integral, being capable of extraction paths come information such as displacement, the curvature of portraying path
Multidate information abundant.By using Attitude estimation algorithm, such as AlphaPose etc., to step S4 treated video data
Human skeleton is extracted by frame, obtains a skeleton time series.Note video frame number is N, and key point number is K (value 15), often
A key point there are two coordinate, then frame sequence be a dimension be 2K, the path that length is N.It can be remembered for Pd={ X1,
X2,...,XN, wherein XiFor 2K dimensional vector.Path PdIt is to practical continuous type key point path P for a discreet pathst:[0,T]
→RdSampling.For Pt, kth rank accumulated path, which divides, to be defined as follows:
Path integral feature is then the set of all rank accumulated paths point, is an infinite dimensional vector.0th rank path is tired
Integral is defined as 1.In general, the iterated integral of preceding m rank portrays the behavioral characteristics in path enough in engineering practice, then
Take its preceding m rank path integral feature as follows:
S(X)|m={ 1, I1,I2,...,Im}
In practice, without Pt, only Pd, path integral can be calculated by tensor algebra at this time.
In the data that above-mentioned steps have been constructed by downloading and pretreatment, the ratio cut partition according to 3:1 is training dataset
It closes with after test set, the neural network model of action recognition is constructed using following methods.
S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and
Step S4 treated several video images, are input in deep neural network.There are three branches in low layer tool for the network, that is, scheme
As branch, light stream branch and skeleton branch, the input of three mode is corresponded respectively to;Pass through Fusion Features in three branches of high level
Merge into a branch.
Image branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond
Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5,
Pond layer attention pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion,
Loss function layer loss;
Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond
Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5,
Pond layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function
Layer loss;
Skeleton branch uses fully-connected network, and full articulamentum fc1, Quan Lian stratum is followed successively by from input layer to output layer
Fc2, data aggregation layer fusion, loss function layer loss.
For overall network structure as shown in Figure 1, in multiple-limb neural network, image branch can capture the space in video
It relies on, light stream branch can capture the presence of the cycling service of each spatial position in video, skeleton branch meticulous depiction
The change in time and space of human body key point position when movement.Three branches pass through respective feature learning network respectively, in data fusion
Layer fusion is merged, to obtain finally abstract space-time characteristic relevant to action recognition.Fusion layers of feature are passed through
Softmax activation, to predict action classification.
In image branch, attention pool is the pooling Network Computing Architecture based on attention mechanism, is such as schemed
Shown in 2, by two groups of construction of the parameter input weight vectors for study after convolution, respectively from the significant of bottom-up
The property weight vectors b and attention weight vectors a from top-down, reuses matrix operation and implements to Projection Character
Bottom-up significance weighted and the weighting of top-down attention, the response for finally merging the two again obtain final result.Assuming that
Feature to pond is X and X ∈ Rn×f, a, b ∈ Rf×1, wherein n is the bulk to pond Projection Character, and f is to Chi Huate
The number of channels of projection is levied,It represents feature X and has carried out the projection after the significance weighted from bottom-up
Figure, the perspective view is unrelated with specific classification, with a thickness of 1, has carried out as shown in Fig. 2, Xa represents feature X from top-down
Attention weighting after perspective view enable categorical measure since different classifications should have different attention weight vectors a
For K, then the attention weight matrix of all categories is A ∈ Rf×K, top-down attention perspective view formula isAs shown in Figure 2, the final attention projection Xa per a piece of particular category is first carried out with conspicuousness projection Xb
It sums by the multiplication of element, then to multiplied result, obtains the eigenmatrix of category attention weighting.
Multi-modal action identification method in the present invention based on deep neural network, passes through multi-modal fusion and attention
Mechanism can capture space-time characteristic relevant to action recognition, to improve the accuracy of identification of network.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention,
It should be equivalent substitute mode, be included within the scope of the present invention.