CN109460707A

CN109460707A - A kind of multi-modal action identification method based on deep neural network

Info

Publication number: CN109460707A
Application number: CN201811165862.4A
Authority: CN
Inventors: 许泽珊; 余卫宇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-10-08
Filing date: 2018-10-08
Publication date: 2019-03-12

Abstract

The invention discloses a multi-modal action recognition method based on a deep neural network. The method comprehensively utilizes multi-modal information such as video images, optical flow diagrams and human skeletons. The specific steps are as follows: First, a series of preprocessing is performed on the video. based on the adjacent frames of the video, obtain the optical flow map; use the pose estimation algorithm to obtain the human skeleton from the video frame by frame, and calculate the path integral feature of the skeleton sequence; the obtained optical flow graph, skeleton path integral feature and The original video image is input into a deep neural network with a multi-branch structure, so that it learns abstract spatiotemporal representations of human actions and correctly judges its action categories. In addition, in the video image branch, a pooling layer based on the attention mechanism is also connected to strengthen the abstract features closely related to the final action classification result and reduce irrelevant interference. The invention comprehensively utilizes multimodal information, and has the advantages of strong robustness and high recognition rate.

Description

A kind of multi-modal action identification method based on deep neural network

Technical field

The present invention relates to technical field of image processing, and in particular to a kind of multi-modal movement knowledge based on deep neural network Other method.

Background technique

Action recognition is a recently very popular research direction, passes through the movement of human body in identification video, Ke Yizuo Processing equipment is interactively entered for a kind of new, the application field of the everyday exposures such as game, video display can be widely used in In.Action recognition task is related to identifying different movements from video clip, and possibly through entire video, this is for movement therein A kind of natural expansion of image classification task, i.e., carry out image recognition in multi-frame video, then calculates from each frame The prediction result finally acted.

Traditional video actions identification technology tends to rely on the feature extractor of hand-designed, to extract the space-time of movement Feature.With being announced to the world splendidly for deep learning, this kind of craft feature extractor is substituted by depth convolutional neural networks.

Although deep learning frame achieves success, visual classification and expression in the field image classification (ImageNet) The framework of learning areas is made slow progress.Mainly huge calculating cost, the two-dimensional convolution nerve net of simple 101 classification Network only has about 5M parameter, and it is about 33M parameter that same architecture, which expands to a three-dimensional structure to increase,.On UCF101 One Three dimensional convolution neural network (3DConvNet) of training needs 3 to 4 days time, and about 2 are needed on Sports-1M Month, this becomes difficult extension framework probes and may overfitting.

Action recognition is related to capturing the space-time context across frame, in addition, the spatial information captured needs the auxiliary of hardware device It helps, is generally compensated using for camera movement, and even if having very strong spatial object detectability that can not meet The demand of movement, because more detailed minutia entrained by motion information is mined out not yet.In order to more pre- It surveys, needs to capture the motion information of local context in video, while capturing the motion information of global context.

Nowadays video actions identification technology has used depth learning technology completely, wherein classical work is double-current convolution mind Through network.The it is proposed of double-current convolutional neural networks is actually to have used for reference the double-current access that information is handled in brain vision system, Wherein: veutro access (what path way) is for handling the spatial informations such as shape, the color of object；Back side (where path Way) access for handle to move, the relevant information in position.Although this method is mentioned by obviously capturing local time's movement The performance for having risen single stream method, since the prediction of video level is obtained by the prediction score of average sample editing, in institute In the feature of study, medium-term and long-term temporal information is still loss.Therefore, there are also can much mention double-current video frequency identifying method The space risen.

Summary of the invention

The purpose of the present invention is to solve drawbacks described above in the prior art, provide a kind of based on deep neural network Multi-modal action identification method.This method increases this mode of human skeleton on double-current convolutional network network foundation.Due to Human body attitude estimates that relatively low (human skeleton, that is, key point has a very strong correlation relationship to difficulty, therefore can be in combination with The clue of bottom-up and top-down is positioned), and have mature Open Framework such as AlphaPose etc., it is introduced into Into action recognition, the interference of extraneous background on the one hand can be eliminated, on the other hand, the human motion of frame sequence meticulous depiction When each key point position situation of change, be conducive to the identification of movement.This method uses the depth mind with multiple-branching construction Multi-modal action recognition is carried out through network, wherein image branch is for handling the spatial informations such as shape, the color of object；Light stream branch Road for handle to move, the relevant information in position；Skeleton branch is reached pair by the path integral feature of processing frame sequence The meticulous depiction of movement.In addition, the invention also introduces a kind of pond method based on attention mechanism in image branch, so that Lime light can be placed on the area-of-interest closely bound up with action classification by image branch automatically, further increase action recognition The accuracy of method.

The purpose of the present invention can be reached by adopting the following technical scheme that:

A kind of multi-modal action identification method based on deep neural network, the action identification method include the following steps:

S1, pass through the disclosed database of acquisition, every frame image of video data is converted into RGB picture set, name rule Data separation then is carried out as filename according to video name+time+movement id, data construct training according to the ratio of 3:1 here Set and test set, wherein movement id includes following six kinds of basic movements: walking, run, wave, bend over, jump, stand.

S2, the data acquisition system unified resolution that step S1 is obtained.

S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from Cosine transform is dissipated to compress the Pixel Information of every frame video pictures.

S4, to step S3 treated video data according to time dimension, by video of the time interval in interval threshold The video frame deletion of frame or picture similarity more than similar threshold value.

S5, to the Optic flow information of step S4 treated data extract N number of successive video frames, wherein N is more than or equal to 10 Positive integer.

S6, using open source Attitude estimation algorithm such as AlphaPose etc., human skeleton is extracted by frame to video, thus Path integral feature is sought to a frame sequence, and to the frame sequence.

S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and Step S4 treated several video images, the input as deep neural network.There are three deep neural network is total in low layer Branch --- one is used for the convolutional neural networks of extraction time feature, and one for extracting the convolutional Neural net of space characteristics Network, one for handling the fully-connected network of skeleton path integral feature.In high level, three branches of low layer are melted by feature A branch is merged into conjunction, and the classification id of video actions is predicted by softmax activation primitive.

Further, the database of step S1 acquisition mainly includes KTH human body behavior database, UCF Sports data Library.

Further, step S2 is unified to 120*90 by video image resolution ratio.

Further, step S3 carries out discrete cosine transform to every frame image of video data, by transformed DCT Coefficient carries out thresholding operation, and the coefficient for being less than certain threshold value is zeroed, so that compression ratio is 10:1, then carries out inverse DCT fortune It calculates, obtains the single-frame images of compressed video data.

Further, the similar variation of the video frame for being spaced in 500ms is greater than 70% view according to time dimension by step S4 Frequency frame deletion reduces redundancy.Wherein, the value range of the interval threshold of time interval is 400ms-1000ms, representative value For 500ms.The value range of the similar threshold value of picture similarity is 0.5-0.9, representative value 0.7.

Further, the Optic flow information that the processing data of step S5 are extracted with 10 successive video frames, mainly passes through Lucas-Kanade algorithm solves basic optical flow equation to all pixels in neighborhood using least square principle, finally obtains required Optic flow information.

Further, step S6 extracts human body by frame to video using open source Attitude estimation algorithm such as AlphaPose etc. Skeleton to obtain a frame sequence, and seeks path integral feature to the frame sequence.

Further, the human skeleton road that step S7 extracts Optic flow information that step S5 is extracted, step S6 Diameter integrates feature and step S4 treated several video images, is input in deep neural network.The deep neural network Network structure is as follows:

There are three branches, i.e. image branch, light stream branch and skeleton branch in low layer tool for the deep neural network, right respectively The input of Ying Yusan mode；Three branches of low layer are merged into a branch by Fusion Features in high level, wherein image Branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer pooling1, volume Lamination conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond layer attention Pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss；

Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, Chi Hua Layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond Change layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss；

Skeleton branch use fully-connected network, be followed successively by from input layer to output layer full articulamentum fc1, Quan Lian stratum fc2, Data aggregation layer fusion, loss function layer loss.

Further, the data aggregation layer fusion acts input video by softmax activation primitive and carries out Classification, and optimize the parameter of network by minimizing Classification Loss function.

Further, in the step S7 image branch pool layers of introducing attention mechanism of attention, will roll up Two groups of construction of the parameter input weight vectors for study after product, respectively the conspicuousness weight vectors b from bottom-up With the attention weight vectors a from top-down, matrix operation implementation is reused to the bottom-up conspicuousness of Projection Character The weighting of weighted sum top-down attention, the response for finally merging the two again obtain final result；Assuming that the feature to pond is X And X ∈ R^n×f, a, b ∈ R^f×1, wherein n is the bulk to pond Projection Character, and f is the port number to pond Projection Character Amount,Represent feature X and carried out the perspective view after the significance weighted from bottom-up, the perspective view with Specific classification is unrelated, represents the perspective view after feature X has carried out the attention weighting from top-down with a thickness of 1, Xa, Since different classifications should have different attention weight vectors a, enabling categorical measure is K, then the attention of all categories Weight matrix is A ∈ R^f×K, top-down attention perspective view formula isFinal every a piece of particular category Attention projects Xa and first carries out being multiplied by element with conspicuousness projection Xb, then sums to multiplied result, obtains the category The eigenmatrix of attention weighting.

Further, data aggregation layer fusion classifies to input video movement by softmax activation primitive, and By minimizing Classification Loss function, the parameter of Lai Youhua network.The training of the network model is not only restricted to specifically train frame Caffe frame, MxNet frame, Torch frame and Tesorflow frame etc. can be used in frame.

The present invention has the following advantages and effects with respect to the prior art:

(1) a kind of multi-modal action identification method based on deep neural network disclosed by the invention has been used with more The deep neural network of branched structure, wherein image branch is for handling the spatial informations such as shape, the color of object；Light stream branch For handle to move, the relevant information in position；Skeleton branch is reached by the path integral feature of processing frame sequence to dynamic The meticulous depiction of work.

(2) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, first by locating in advance Reason reduces the calculation amount of network, to be substantially reduced operation time, and can comprehensively utilize video image, light stream figure and human body The multi-modal informations such as skeleton significantly improve video actions accuracy of identification.

(3) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, in image branch Pond layer operation introduces a kind of attention weighting pondization operation, it can voluntarily each pond unit be arrived in study in training Weight, the bigger pond unit of weight corresponds to the abstract characteristics closely bound up with the movement, and the Chi Huadan that weight is smaller Member corresponds to other features that should ignore or can generate interference to action recognition.By based on attention mechanism After pooling structure, the feature unrelated with action classification will be ignored, and will be by with closely bound up feature is acted " amplification " improves the accuracy rate and precision of action recognition.

Detailed description of the invention

Fig. 1 is a kind of multi-modal action identification method model signal based on deep neural network disclosed in the present invention Figure；

Fig. 2 is the pooling Structure Calculation schematic diagram proposed by the present invention based on attention mechanism；

Fig. 3 is a kind of process of the multi-modal action identification method method based on deep neural network disclosed in the present invention Figure.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Embodiment

As shown in Figure 1, present embodiment discloses a kind of multi-modal action identification method based on deep neural network.

There are three branches altogether in low layer for the deep neural network that the present embodiment uses --- and one is used for extraction time feature Convolutional neural networks, one for extracting the convolutional neural networks of space characteristics, one is special for handling skeleton path integral The fully-connected network of sign.In high level, three branches merge into a branch by Fusion Features, and activate letter by softmax The classification id of number prediction video actions.In image branch, a kind of pooling structure based on attention mechanism is introduced, it can On the basis of not changing existing network infrastructure, network structure is helped to focus on the feature for being conducive to identification maneuver, to reduce nothing The interference for closing feature, improves the performance of existing network, and video human motion recognition system is enabled more effectively to be applied to engineering.

As the embodiment of the present invention, the complete training precision that can be improved model of training data, in addition to data into Row pretreatment and compression, can be further reduced the interference of redundancy and irrelevant information, reduce the calculation amount of model, thus It reduces the model training time and improves training precision, therefore as the embodiment of the present invention, the multimode based on deep neural network State action identification method is as follows:

The collection of S1, training data

It mainly include following data library: KTH human body behavior database, UCF Sports by acquiring disclosed database Every frame image of video data is converted to RGB picture set by database, and naming rule is made according to video name+time+movement id Data separation is carried out for filename, data are gathered according to the ratio construction training set of 3:1 and test here, wherein movement id packet Containing six kinds of basic movements: walking, run, wave, bend over, jump, stand.

S2, the data acquisition system obtained to step S1 normalize, i.e. unified resolution size, i.e., to the picture specification of every frame into It, is uniformly arrived the resolution ratio of 120*90 by row compression, on the basis of guaranteeing the information integrity of image as far as possible, reduces convolution mind Calculation amount through network model improves recognition speed.

S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from It dissipates cosine transform DCT to compress the Pixel Information of every frame video pictures, compression ratio 10:1 can reduce initialization with this Information content when processing.Discrete cosine transform is carried out to original image, transformed DCT coefficient is subjected to thresholding operation, it will Coefficient less than threshold value is zeroed, compression quantization image process, then carries out inverse DCT operation, available final compressed figure Picture.

S4, to step S3 treated video data according to time dimension, by video frame of the time interval in 500ms or Video frame deletion of person's picture similarity 0.7 or more reduces redundancy.

Wherein, the method and step for calculating picture similarity is as follows:

S41, scaling pictures: being general size 8*8,64 pixel values by picture compression；

S42, simplify color, be converted into grayscale image；

S43, it calculates average value: calculating the average value of the pixel value of grayscale image all pixels point；

S44, compared pixels gray value: the average value that each pixel value and previous step for traversing grayscale image calculate is greater than Average value is recorded as 1, is otherwise 0；

S45,64 bit image fingerprints are obtained；

S46, calculate two pictures finger image Hamming distance, using the Hamming distance as picture similarity.

S5, optical flow method mainly utilize in image sequence pixel in the variation in time-domain and the correlation between consecutive frame Property finds previous frame with corresponding relationship existing between present frame, to calculate the motion information of object between consecutive frame. Lucas-Kanade method is a kind of widely used light stream estimation difference method, is owned using least square principle in neighborhood Pixel solves basic optical flow equation, compares common point by point method, and Lucas-Kanade algorithm is more insensitive for picture noise. Therefore the bi-directional light of 10 successive video frames is extracted using Lucas-Kanade algorithm to step S4 treated video requency frame data Stream information.Wherein, Lucas-Kanade algorithm is Lucas B and Kanade T.An Iterative Image Registration Technique with an Application to Stereo Vision.Proc.Of 7th International Joint Conference on Artificial Intelligence (IJCAI), pp.674-679 opinion The method that text is mentioned, has been carried out in openCV, therefore it is using the Lucas- on openCV that Optic flow information is extracted in this realization Kanade extracts Optic flow information.

S6, path integral feature passage path iterated integral, being capable of extraction paths come information such as displacement, the curvature of portraying path Multidate information abundant.By using Attitude estimation algorithm, such as AlphaPose etc., to step S4 treated video data Human skeleton is extracted by frame, obtains a skeleton time series.Note video frame number is N, and key point number is K (value 15), often A key point there are two coordinate, then frame sequence be a dimension be 2K, the path that length is N.It can be remembered for P_d={ X₁, X₂,...,X_N, wherein X_iFor 2K dimensional vector.Path P_dIt is to practical continuous type key point path P for a discreet paths_t:[0,T] →R^dSampling.For P_t, kth rank accumulated path, which divides, to be defined as follows:

Path integral feature is then the set of all rank accumulated paths point, is an infinite dimensional vector.0th rank path is tired Integral is defined as 1.In general, the iterated integral of preceding m rank portrays the behavioral characteristics in path enough in engineering practice, then Take its preceding m rank path integral feature as follows:

S(X)|_m={ 1, I₁,I₂,...,I_m}

In practice, without P_t, only P_d, path integral can be calculated by tensor algebra at this time.

In the data that above-mentioned steps have been constructed by downloading and pretreatment, the ratio cut partition according to 3:1 is training dataset It closes with after test set, the neural network model of action recognition is constructed using following methods.

S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and Step S4 treated several video images, are input in deep neural network.There are three branches in low layer tool for the network, that is, scheme As branch, light stream branch and skeleton branch, the input of three mode is corresponded respectively to；Pass through Fusion Features in three branches of high level Merge into a branch.

Image branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Pond layer attention pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, Loss function layer loss；

Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Pond layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function Layer loss；

Skeleton branch uses fully-connected network, and full articulamentum fc1, Quan Lian stratum is followed successively by from input layer to output layer Fc2, data aggregation layer fusion, loss function layer loss.

For overall network structure as shown in Figure 1, in multiple-limb neural network, image branch can capture the space in video It relies on, light stream branch can capture the presence of the cycling service of each spatial position in video, skeleton branch meticulous depiction The change in time and space of human body key point position when movement.Three branches pass through respective feature learning network respectively, in data fusion Layer fusion is merged, to obtain finally abstract space-time characteristic relevant to action recognition.Fusion layers of feature are passed through Softmax activation, to predict action classification.

In image branch, attention pool is the pooling Network Computing Architecture based on attention mechanism, is such as schemed Shown in 2, by two groups of construction of the parameter input weight vectors for study after convolution, respectively from the significant of bottom-up The property weight vectors b and attention weight vectors a from top-down, reuses matrix operation and implements to Projection Character Bottom-up significance weighted and the weighting of top-down attention, the response for finally merging the two again obtain final result.Assuming that Feature to pond is X and X ∈ R^n×f, a, b ∈ R^f×1, wherein n is the bulk to pond Projection Character, and f is to Chi Huate The number of channels of projection is levied,It represents feature X and has carried out the projection after the significance weighted from bottom-up Figure, the perspective view is unrelated with specific classification, with a thickness of 1, has carried out as shown in Fig. 2, Xa represents feature X from top-down Attention weighting after perspective view enable categorical measure since different classifications should have different attention weight vectors a For K, then the attention weight matrix of all categories is A ∈ R^f×K, top-down attention perspective view formula isAs shown in Figure 2, the final attention projection Xa per a piece of particular category is first carried out with conspicuousness projection Xb It sums by the multiplication of element, then to multiplied result, obtains the eigenmatrix of category attention weighting.

Multi-modal action identification method in the present invention based on deep neural network, passes through multi-modal fusion and attention Mechanism can capture space-time characteristic relevant to action recognition, to improve the accuracy of identification of network.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. a kind of multi-modal action identification method based on deep neural network, which is characterized in that the action identification method Include:

The collection of S1, training data acquire disclosed database, the set of construction training according to a certain percentage and test set, and Every frame image of video data is converted into RGB picture set；

S2, the data acquisition system unified resolution obtained to step S1 carry out the picture specification of every frame image of video data Compression；

S3, data compression is carried out to step S2 treated sets of image data；

S4, to step S3 treated video data according to time dimension, by video frame of the time interval in interval threshold or Video frame deletion of person's picture similarity more than similar threshold value；

S5, to the two-way Optic flow information of step S4 treated data extract N number of successive video frames, wherein N is more than or equal to 10 Positive integer；

S6, human skeleton is extracted by frame to step S4 treated data, and calculates the path integral feature of frame sequence；

S7, the human skeleton path integral feature and step for extracting Optic flow information that step S5 is extracted, step S6 S4 treated several video images, the input as deep neural network, wherein the deep neural network has in low layer There are three branches, correspond respectively to the input of three mode, three branches of low layer are merged into one by Fusion Features in high level A branch, and the classification id that institute's input video acts is predicted by softmax activation primitive.

2. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The naming rule of every frame image carries out data separation as filename according to video name+time+movement id in the step S1, Wherein, it classification id of the movement id as video actions, including following elemental motion: walks, run, wave, bend over, jump, stand.

3. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that Data compression, process are carried out using Pixel Information of the discrete cosine transform of image to every frame video pictures in the step S3 It is as follows:

Discrete cosine transform is carried out to every frame image of video data, transformed DCT coefficient is subjected to thresholding operation, it will Coefficient less than certain threshold value is zeroed, and then carries out inverse DCT operation, obtains the single-frame images of compressed video data.

4. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The video frame deletion of video frame or picture similarity 0.7 or more in the step S4 by time interval in 500ms.

5. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that Steps are as follows for the calculating of the picture similarity:

Picture compression is certain proportion size W*W, wherein W is pixel quantity by S41, scaling pictures；

S42, simplify color, be converted into grayscale image；

S43, average value is calculated, calculates the average value of the pixel value of grayscale image all pixels point；

S44, compared pixels gray value traverse each pixel value and above-mentioned average value of grayscale image, are greater than average value and record It is 1, is otherwise 0；

S45, W is obtained²Bit image fingerprint；

6. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that All pixels in neighborhood are solved using Lucas-Kanade algorithm in the step S5 and using least square principle basic Optical flow equation finally extracts the two-way Optic flow information of N number of successive video frames.

7. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The step S6 extracts human skeleton by frame using open source Attitude estimation algorithm, to video, so that a frame sequence is obtained, And path integral feature is sought to the frame sequence.

8. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The network structure of deep neural network is as follows in the step S7:

There are three branches, i.e. image branch, light stream branch and skeleton branch in low layer tool for the deep neural network, correspond respectively to The input of three mode；Three branches of low layer are merged into a branch by Fusion Features in high level, wherein image branch It using convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer pooling1, convolutional layer Conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond layer attention Pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss；

Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer Pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Chi Hua Layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss；

Skeleton branch uses fully-connected network, and full articulamentum fc1, Quan Lian stratum fc2, data are followed successively by from input layer to output layer Fused layer fusion, loss function layer loss.

9. a kind of multi-modal action identification method based on deep neural network according to claim 8, which is characterized in that The data aggregation layer fusion classifies to input video movement by softmax activation primitive, and passes through minimum Classification Loss function optimizes the parameter of network.

10. a kind of multi-modal action identification method based on deep neural network according to claim 8, feature exist In pool layers of introducing attention mechanism of the attention of image branch in the step S7 input the parameter after convolution It constructs two groups of weight vectors for study, the respectively conspicuousness weight vectors b from bottom-up and comes from top-down Attention weight vectors a, reuse matrix operation and implement to the bottom-up significance weighted and top-down of Projection Character Attention weighting, the response for finally merging the two again obtain final result；Assuming that the feature to pond is X and X ∈ R^n×f, a, b ∈ R^f×1, wherein n is the bulk to pond Projection Character, and f is the number of channels to pond Projection Character, It representing feature X and has carried out the perspective view after the significance weighted from bottom-up, the perspective view is unrelated with specific classification, It represents the perspective view after feature X has carried out the attention weighting from top-down with a thickness of 1, Xa, due to different classifications There should be different attention weight vectors a, enabling categorical measure is K, then the attention weight matrix of all categories is A ∈ R^f ^×K, top-down attention perspective view formula isThe final attention projection Xa per a piece of particular category first and Conspicuousness projects Xb progress by the multiplication of element, then sums to multiplied result, obtains the feature of category attention weighting Matrix.