CN114387539A - Behavior identification method based on SimAM attention mechanism - Google Patents
Behavior identification method based on SimAM attention mechanism Download PDFInfo
- Publication number
- CN114387539A CN114387539A CN202111572410.XA CN202111572410A CN114387539A CN 114387539 A CN114387539 A CN 114387539A CN 202111572410 A CN202111572410 A CN 202111572410A CN 114387539 A CN114387539 A CN 114387539A
- Authority
- CN
- China
- Prior art keywords
- feature map
- attention
- flow
- output
- channel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a behavior recognition method based on a SimAM attention mechanism, which makes full use of the non-parameter characteristic of a non-parameter attention module SimAM and organically combines the non-parameter characteristic with a double-current network model, so that the recognition accuracy can be improved, network parameters and training difficulty can be reduced, different fusion modes can be adaptively adopted according to different high-low layer characteristic contents (namely resolutions) of input videos, the perception capability of details is improved, a high-performance behavior recognition algorithm is obtained, the double-current network model is more effectively optimized, and the engineering property and the accuracy of the behavior recognition algorithm of the scheme are improved.
Description
Technical Field
The invention belongs to a computer vision technology in the field of artificial intelligence, and particularly relates to a behavior identification method based on a SimAM attention mechanism.
Technical Field
Under the background of the era of internet big data, more and more videos are shared, and the method for quickly extracting information from massive video resources has extremely high research and application values. Human behavior recognition in videos also gradually becomes a great research hotspot in the field of computer vision, and is widely applied in the fields of public video monitoring, human-computer interaction, scientific cognition, medical rehabilitation and the like. In recent years, with the increasing level of computer computing power, deep learning has been developed, and behavior recognition algorithms based on deep learning are gradually turning out.
At present, behavior recognition algorithms based on deep learning are mainly classified into three categories:
(1) single flow network model method. A three-dimensional convolutional neural network proposed by Shuiwang Ji et al, i.e., 3D CNN, is widely used in this network model method. The convolution network structure comprises a hard connection layer, three-dimensional convolution layers, two pooling layers and a full connection layer.
(2) A dual-flow network model method. Simony et al was inspired by the two-stream hypothesis of neuroscience, and the two-stream network model was first creatively proposed in 2014. The two-flow hypothesis assumes that the visual cortex has ventral pathways that are sensitive to the shape and color response of the target and dorsal pathways that are sensitive to the spatial translation response caused by the motion of the target. The double-current network architecture simulates a visual cortex to establish a time information channel and a space information channel, adopts mutually independent parallel CNN networks to extract time characteristics and space characteristics of a video, and finally performs characteristic fusion.
(3) Multi-stream network model approach. The network model method is an expansion of a double-flow network model, and other CNN networks are added on the basis of the double-flow network model to extract different characteristics.
Throughout the development process of deep learning, significant breakthrough and discovery are provided on the basis of certain neuroscience theory. In the visual neuroscience, the most informative neurons are typically those that exhibit unique firing patterns of peripheral neurons. Furthermore, active neurons may also inhibit the activity of peripheral neurons, a phenomenon known as spatial inhibition. In visual processing, neurons that exhibit significant spatial suppression should be prioritized. The simplest implementation to find these neurons is to measure the linear separation between one target neuron and the other. Recently Linxiao Yang et al proposed a Parameter-Free Attention Module (SimAM, A Simple, Parameter-Free Attention Module for a Convolutional Neural network) of a Convolutional Neural network based on the spatial inhibitory effect of such neurons, which forms the Attention of the corresponding neuron by calculating an energy function of each neuron to study the importance of each neuron.
The double-flow network model method mentioned above has good generalization and expansibility, but it is successively found that the double-flow network model fuses the recognition result by late fusion of the time flow and the spatial flow, and has poor detail perception capability. Therefore, many improved models are proposed on the basis of the model, for example, an attention mechanism module is added to a dual-flow network model to form a structure of a "dual-flow network model + attention mechanism", and the existing attention mechanism is provided with channel attention, spatial attention and the like. The double-flow network model is combined with the attention mechanism, the recognition accuracy is effectively improved, and the complexity and the training difficulty of the model are greatly improved, so that the further application of the model is influenced.
Disclosure of Invention
The invention aims to solve the technical problem of providing a behavior identification method based on a SimAM attention mechanism based on the existing artificial intelligence and machine vision technology aiming at the defects of the existing behavior identification algorithm related in the technical background.
The principle is as follows:
the video features are extracted by using the existing double-flow network model, the features are further strengthened by combining with a lightweight SimAM attention mechanism, different feature fusion modes are adopted according to different high and low level feature contents in the video, the identification accuracy is improved, the network parameters and the training difficulty are reduced, and the detail perception capability is improved.
A behavior identification method based on a SimAM attention mechanism comprises the following steps:
sampling an input video, and randomly extracting a sample feature map, wherein the sample feature map comprises: scaling and cutting a sample characteristic graph by using a first video image of a single frame and a second video image of a plurality of continuous frames;
spatial stream feature extraction and temporal stream feature extraction: carrying out three-primary color channel decomposition on the first video image to obtain a three-primary color channel image; computing a stacked optical flow for the second video image; inputting a single frame of input spatial stream CNN of a three-primary color channel image to extract a spatial stream characteristic diagram, and inputting a stacked optical stream into a time stream CNN to extract a time stream characteristic diagram;
respectively calculating SimAM attention of the spatial stream CNN and the time stream CNN, and respectively fusing the SimAM attention with the corresponding spatial stream characteristic diagram and time stream characteristic diagram into a spatial stream attention fusion characteristic diagram and a time stream attention fusion characteristic diagram;
early fusion and output of features: the method comprises the steps of using an early fusion mode to fuse features of input videos containing more high-level features, enabling the resolution of single-frame input videos to be larger than 1080p, inputting fusion features of a spatial flow attention fusion feature map and a temporal flow attention fusion feature map into a full-link layer for classification, and outputting results through a softmax function;
and (4) late fusion and output of results: and for the input video containing more low-layer features, fusing the features in a late fusion mode, wherein the resolution of a single-frame input video is less than 1080p, inputting a spatial stream attention fusion feature map into a full-link layer for classification in a spatial stream CNN, inputting a temporal stream attention fusion feature map into the full-link layer for classification in a temporal stream CNN, adopting a mean fusion output value, and finally inputting the output value into a softmax function output result.
The low-level features refer to features such as frames, outlines, positions and sizes, and the high-level features refer to features such as textures; generally, the higher the resolution, the more high-level features are contained.
Further, the specific steps of scaling and clipping the sample feature map are as follows: and scaling the sample characteristic diagram to be N × N, and then randomly cutting the sample characteristic diagram to be N ' × N ', wherein N represents the pixel size of the image, and N ' represents the pixel size of the image after random cutting.
Further, the three primary color channel image acquisition method specifically includes the following steps:
decomposing the zoomed and cut first video image into three channels of red, green and blue to obtain an image X under the three channelsz(x,y),
Wherein z represents a z channel, z is an integer which is more than or equal to 1 and less than or equal to 3, and the 1 st channel, the 2 nd channel and the 3 rd channel respectively represent a red channel, a green channel and a blue channel;
x is the horizontal coordinate of the pixel point in the image, and y is the longitudinal coordinate of the pixel point in the image.
Further, the calculating of the optical flow of the stack specifically comprises the steps of:
the optical flow is regarded as a set of displacement vector fields between successive frames t and t + 1; noting the point (u, v) in the t-th frame, the optical flow of the t-th frame is IτThe calculation formula is as follows:
wherein: u ═ 1; w, v ═ 1: h, k ═ 1; l ]; w is the width of the second video picture, h is the height of the second video picture, L is the number of frames of the second video picture,
finally, a stacked optical flow of N' × 2L is obtained.
Further, the extracting of the spatial flow feature map or the temporal flow feature map specifically includes the following steps:
step 1: performing a filling operation on the input three primary color channel image or the stacked optical flow, expanding from N '. times.N' to (N '+ 7). times.N' +7, and filling the expanded portion with a value of 0; performing convolution on the three-primary-color channel images or the stacked optical flows respectively by using convolution kernels of 7 × 96 with the step size of 2 to generate a feature map;
step 2: and (3) linearly rectifying the characteristic diagram generated in the step 1 by using a ReLU, wherein the formula of a ReLU function is as follows:
ReLU(m)=max(0,m)
wherein m is an independent variable; max is the maximum of 0 and m;
and step 3: performing maximum pooling operation on the feature map subjected to linear rectification, wherein the pooling size is 2 x 2;
and 4, step 4: convolving the feature map generated in the step 3, setting the size of a convolution kernel to be 5 × 256, the step size to be 2, and the pooling size to be 2 × 2;
and 5: convolving the feature map generated in the step 4, setting the size of a convolution kernel to be 3 × 512 and the step size to be 1;
step 6, performing convolution on the feature map generated in the step 5, setting the convolution kernel size to be 3 x 512 and the step size to be 1;
and 7: convolving the feature map generated in the step 6, and setting the convolution kernel size to be 3 × 512, the step size to be 1 and the pooling size to be 2 × 2;
finally, 512 space flow characteristic graphs S are respectively obtainedpConsidered as 512 channels, wherein: p is e [1,512 ]]Each Sp(N '/32) × (N'/32) neurons;
time flow profile SqConsidered as 512 channels, wherein: q is an element of [1,512 ]]Each SqThere are (N '/32) × (N'/32) neurons.
Further, the SimAM attention of the spatial stream CNN and the temporal stream CNN are calculated respectively, and are associated with the corresponding spatial stream signature SpAnd time flow profile SqRespectively fused into a space flow attention fusion feature map S'pAnd time streamerIntention fusion feature map S'qThe method specifically comprises the following steps:
step 1: SimAM attention calculation for spatial streams CNN and temporal streams CNN:
calculating an energy function e for each neuronr:
Where r represents a target neuron in a single input channel; q. q.siRepresenting other neurons in the input channel, i being a serial number; w is arLinear conversion of the weights; brIs a biased linear transformation; (N '/32) × (N'/32) is the number of other neurons on the channel; gamma is a variable; λ is a coefficient;
w is calculated using the following formular:
B is calculated using the following formular:
μrThe mean value is expressed by the formula:
σr 2the variance is expressed by the formula:
The smaller the energy of each neuron, the more important it is in comparison to other neurons in the same channel, the reciprocal of the minimum neuron energyIs the weight of the neuron;
one neuron corresponds to one neuron energyThe energy of all neurons of a single channel forms an energy matrix E of the channel, an attention weight matrix E' of the channel is obtained by taking the reciprocal of each element in the energy matrix E and normalizing the reciprocal through a sigmod function, and the calculation formula is as follows:
step 2: the method for fusing the SimAM attention of the spatial stream CNN and the temporal stream CNN with the spatial stream feature map and the temporal stream feature map respectively comprises the following steps:
the fused characteristic diagram S' has the calculation formula as follows:
S′=S·E′
wherein S represents a spatial stream feature map SpOr spatial flow profile SqS 'represents a spatial stream attention fusion feature map S'pOr time-flow attention fusion feature map S'qOne of them.
Further, the dual-flow attention fusion feature map fused by the spatial flow attention fusion feature map and the temporal flow attention fusion feature map specifically includes:
feature map S 'is fused to spatial stream attention'pAnd time-flow attention fusion feature map S'qThe cascade fusion is carried out, and the fusion is carried out,
wherein:feature map S 'output by space flow network'pAt some position (i, j, d), s represents space,is a feature map S 'output by the corresponding time flow network'qAt a certain position (i, j, d), t represents time,a certain position on the feature map is obtained for cascade fusion,obtaining a certain position (i, j,2d-1) on the feature map for cascade fusion, wherein d represents the d-th feature map;
obtaining a cascade fusion characteristic diagram: a size of (N'/32) × 1024;
and (3) carrying out three-layer convolution successively on the cascade fusion feature map, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, the last layer of convolution plays a role in reducing dimensionality, and finally the double-flow attention fusion feature map S is obtaineddThe size is (N'/32) × 512.
Further, in the above-mentioned case,
the full connectivity layer is classified as:
output=f(wTA+b)
w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and output dimension is H x 1;
the input vector A is a double-flow attention fusion characteristic diagram SdTime-flow attention fusion feature map S'qOr space flow attention fusion feature map S'pOne of (1);
the obtained output is one of output s _ output of a space flow network model, output t _ output of a time flow network model or output d _ output of a double-flow network model;
further, the method comprises the following steps of;
solving the final behavior recognition result specifically includes:
the softmax function expression is:
wherein: outputiRepresents the ith element in output, outputkRepresenting the Kth element in output, P representing probability, and exp () representing an exponential function with a natural logarithm e as a base number; h is the number of elements;
the output of the softmax is d _ output or the average value of s _ output and t _ output;
and the behavior classification label corresponding to the element with the highest probability in the H elements is the final behavior identification result.
Further: the spatial stream CNN or the spatial stream CNN is trained by adopting a random gradient descent algorithm;
the gradient of the output layer is calculated,the gradient of the Kth node of the output layer (Kth layer) is represented, and the calculation formula is as follows:
wherein o iskRepresenting the output of the kth node of the K layer, tkA label representing a kth node of a kth layer;
the gradient of the hidden layer is calculated,the gradient of the ith node of the ith hidden layer is represented, and the calculation formula is as follows:
wherein o isiA label value representing the ith node of the ith hidden layer,represents the gradient of the ith node of the layer above the I-th hidden layer (J-th layer), wabRepresenting the value of the a row and the b column of the weight matrix; and updating the parameters according to the calculated gradient.
The main innovation points of the invention are as follows:
(1) combining a double-current network model with a lightweight SimAM attention mechanism;
(2) and adopting different feature fusion modes according to different high and low level feature contents of the video.
Compared with the prior art, the invention has the following advantages: the behavior recognition method based on the SimAM attention mechanism provided by the invention uses the CNN-M-1024 network in the space stream and the time stream respectively, can fully extract the space information and the time information of the video, is fused with the SimAM attention mechanism, improves the accuracy of video behavior recognition under the condition of not increasing network parameters, and realizes the balance of model accuracy and training complexity. The method adopts different fusion modes according to different adaptability of the high-low layer characteristic content of the input video, and further improves the accuracy of the model.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention using an early fusion approach;
fig. 2 is a flowchart of an embodiment of the present invention using a late fusion mode.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the present invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, components are exaggerated for clarity.
The detailed steps of the present invention are given below:
the behavior identification method based on the SimAM attention mechanism comprises the following steps:
pretreatment part (step 1 to step 2):
step 1), sampling an input video by using an open UCF-101 data set, and firstly randomly extracting a frame of image which is marked as imgaWherein a is a group symbol. Then randomly extracting ten continuous frames of images and recording as imgbiWhere i ∈ [1,10 ]]And b is a group mark;
step 2), carrying out the same zooming and clipping on the eleven frames of images obtained by sampling; firstly, scaling the image to 256 × 256, and then randomly cutting the image to 224 × 224;
spatial stream feature extraction section (step 3 to step 10):
step 3), input image imgaDecomposing into red, green and blue channels to obtain image X under three channelsz(x, y), wherein z represents a z channel, z is an integer which is greater than or equal to 1 and less than or equal to 3, and the 1 st, 2 nd and 3 rd channels respectively represent a red channel, a green channel and a blue channel; x and y are respectively the horizontal coordinate and the vertical coordinate of a pixel point in the image;
step 4), for Xz(x, y) performing a fill operation to expand the image from 224 x 224 to 231 x 231, the expanded portion being filled with a value of 0; using convolution kernels of 7X 96, X is each pair with a step size of 2zConvolution by convoluting the convolution kernel to conv1j(m, n) where j ∈ [1,96 ]];m,n∈[0,6]J represents the jth convolution kernel; m and n are subscripts of the m-th row element and the n-th column element in the convolution kernel respectively. The convolution process can be represented by the following steps:
step 4.1), calculating the convolution of a single channel, wherein the process can be expressed by the following formula:
wherein x belongs to [0,112], y belongs to [0,112], and the symbol represents multiplication of corresponding elements of the matrix;
step 4.2), calculating the sum of the convolutions of the three channels, wherein the process can be expressed by the following formula:
step 5), the generated characteristic diagram F is subjected toj(x, y) is linearly rectified, and the process can be expressed by the following formula:
Fj′(x,y)=ReLU(Fj′(x,y))
wherein the formula for ReLU is:
ReLU(x)=max(0,x)
step 6), carrying out linear rectification on the characteristic diagram Fj' (x, y) performing a maximum pooling operation with a pooling size of 2 x 2, and finally generating 56 x 56 signature Fj"(x, y), the process can be expressed as follows:
Fj″(x,y)=max(Fj′(2x,2y),Fj′(2x+1,2y),Fj′(2x,2y+1),Fj′(2x+1,2y+1))
wherein x belongs to [0,56], y belongs to [0,56 ];
step 7), performing convolution on the feature map, setting the convolution kernel size to be 5 × 256, the step size to be 2, and the pooling size to be 2 × 2, so as to obtain the feature map of 14 × 256, wherein the convolution process is the same as the steps 4 to 6;
step 8), performing convolution on the feature map, setting the size of a convolution kernel to be 3 × 512, setting the step size to be 1, obtaining the feature map of 14 × 512, and performing the convolution process in the same step 4;
step 9), performing convolution on the feature map, setting the size of a convolution kernel to be 3 × 512, setting the step size to be 1, obtaining the feature map of 14 × 512, and performing the convolution process in the same step 4;
and step 10), performing convolution on the feature map, setting the convolution kernel size to be 3 × 512, the step size to be 1, and the pooling size to be 2 × 2, and finally obtaining 7 × 512 feature maps, wherein the convolution process is the same as that from step 4 to step 6.
Spatial flow SimAM part (step 11 to step 13):
step 11), taking the 512 spatial stream feature maps of the spatial stream network obtained in step 10 as 512 channels, which are marked as SpWhere p ∈ [1,512 ]](ii) a Each SpThere are 49 neurons in total, 7 by 7; calculating an energy function e for each neuronrThe calculation formula is as follows:
where r represents a target neuron in a single input channel; q. q.siRepresenting other neurons in the input channel, i being a serial number; w is arLinear conversion of the weights; brIs a biased linear transformation; m49 is the number of other neurons on the channel; gamma is a variable; λ is a coefficient.
W is calculated using the following formular:
B is calculated using the following formular:
μrThe mean value is expressed by the formula:
σr 2the variance is expressed by the formula:
The smaller the energy per neuron, the more important it is in comparison to other neurons in the same channel, so the reciprocal of the minimum neuron energy is usedRepresenting the weight of the neuron.
Step 12) the energy of all neurons of a single channel forms an energy matrix E of the channel, an attention weight matrix E' of the channel is obtained by taking the reciprocal of each element in the energy matrix E and normalizing the reciprocal through a sigmod function, and the calculation formula is as follows:
step 13), fusing the feature map with the channel attention weight, and calculating a fused feature map S'pThe calculation formula is as follows:
S′p=Sp·E′,p∈[1,512]
time stream feature extraction section (step 14 to step 15):
step 14), calculating a stack of optical flows; the dense optical flow can be viewed as a set of displacement vector fields between successive frames t and t + 1; noting the point (u, v) in the t-th frame, the optical flow of the t-th frame is IτThe calculation formula is as follows:
wherein u ═ 1; w, v ═ 1: h, k ═ 1; l ], w is the width of the second video picture, h is the height of the second video picture, and L is the number of frames of the second video picture.
Frame number L takes 10, which ultimately results in a stacked optical flow input of 224 × 20.
Step 15), performing convolution on the feature map for 5 times, wherein the convolution step is the same as the steps from step 4 to step 10;
time-flow SimAM part (step 16 to step 17):
step 16), calculating attention in the time flow network model, and synchronizing the steps 11 to 12;
step 17), fusing the feature map and the channel attention weight, and synchronizing step 13;
a feature fusion and output part (step 18 to step 23):
if the resolution of one dimension of the input video is greater than 1080p, adopting steps 18-21;
feature early fusion part (step 18 to step 19):
step 18), recording the value of a certain position (i, j, d) on the characteristic diagram output by the spatial stream network ass represents a space corresponding to a position (i, j, d) on the profile of the time stream network output having a value oft represents time, and the position obtained by cascade fusion iscat represents the cascade, which is calculated as follows:
the size of the feature map obtained by cascade fusion is 7 × 1024;
step 19), carrying out three-layer convolution successively on the feature maps obtained by cascade fusion, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, the last layer of convolution plays a role in reducing dimensionality, and the size of the finally obtained output feature map is 7 × 512 and is consistent with the size of the original feature map; the convolution process is the same as the steps 4 to 6;
full connection (step 20):
step 20), sending the feature diagram of the fused SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain the output d _ output of the dual-flow network model, wherein d represents time and space dual flow, and the calculation formula is as follows,
output=f(wTA+b)
w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and the output dimension is 101 x 1.
Sending the output value d _ output of the merged output into a softmax function to obtain a final classification result;
if the resolution of all dimensions of the input video is less than 1080p, adopting the steps 22-23;
late fusion of results (step 21 to step 23):
step 21), sending the feature diagram of the fusion SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain an output s _ output of the spatial flow network model, wherein s represents a space, and synchronizing the step 20;
step 22), sending the feature diagram of the fusion SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain the output t _ output of the time flow network model, wherein t represents time, and synchronizing the step 20;
step 23), the output average values of the spatial flow network model and the time flow network model are fusedObtaining output values of a dual-flow network modelAnd sending the softmax function to obtain a final behavior recognition result.
Output section (step 24)
Step 24), the expression of the softmax function is:
wherein: outputiRepresents the ith element in output, outputkRepresenting the kth element in output, P representing probability, exp () representing an exponential function with the natural logarithm e as the base; and the behavior classification label corresponding to the element with the highest probability in the 101 elements is the final behavior recognition result.
The neural network training part:
step 25), training a neural network by using a stochastic gradient descent algorithm (SGD);
step 25.1), calculating the gradient of the output layer,the gradient of the kth node of the output layer (kth layer) is represented, and the calculation formula is as follows:
wherein o iskRepresenting the output of the kth node of the K layer, tkA label representing the kth node of the kth level.
Step 25.2), the gradient of the hidden layer is calculated,the gradient of the ith node of the ith hidden layer is represented, and the calculation formula is as follows:
wherein o isiA label value representing the ith node of the ith hidden layer,represents the gradient of the ith node of the layer above the I-th hidden layer (J-th layer), wabRepresenting the values of the a-th row and the b-th column of the weight matrix.
And 25.3) updating the parameters according to the calculated gradient.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A behavior identification method based on a SimAM attention mechanism is characterized by comprising the following steps:
sampling an input video, and randomly extracting a sample feature map, wherein the sample feature map comprises: scaling and cutting a sample characteristic graph by using a first video image of a single frame and a second video image of a plurality of continuous frames;
carrying out three-primary color channel decomposition on the first video image to obtain a three-primary color channel image; computing a stacked optical flow for the second video image;
inputting a single frame of input spatial stream CNN of a three-primary color channel image to extract a spatial stream characteristic diagram, and inputting a stacked optical stream into a time stream CNN to extract a time stream characteristic diagram;
calculating SimAM attention of the spatial stream CNN and the time stream CNN, and fusing the SimAM attention with corresponding spatial stream characteristic diagrams and time stream characteristic diagrams respectively to form corresponding spatial stream attention fusion characteristic diagrams and time stream attention fusion characteristic diagrams;
for an input video containing more high-level features, inputting a double-flow attention fusion feature map fused by a space flow attention fusion feature map and a time flow attention fusion feature map into full-link layer classification, and inputting an output value of the double-flow attention fusion feature map into a softmax function to solve a final behavior recognition result; wherein the higher layer feature is represented by a single frame input video resolution greater than 1080 p;
for an input video containing more low-level features, inputting a spatial stream attention fusion feature map into a full-link layer for classification in a spatial stream CNN, inputting a temporal stream attention fusion feature map into the full-link layer for classification in a temporal stream CNN, and inputting a softmax function after mean fusion to solve a final behavior recognition result for two network output values.
2. The method of claim 1, wherein scaling and cropping the sample feature map comprises: and scaling the sample characteristic diagram to be N × N, and then randomly cutting the sample characteristic diagram to be N ' × N ', wherein N represents the pixel size of the image, and N ' represents the pixel size of the image after random cutting.
3. The behavior recognition method based on the SimAM attention mechanism as claimed in claim 2, wherein the three primary color channel image acquisition method comprises:
decomposing the zoomed and cut first video image into three channels of red, green and blue to obtain an image X under the three channelsz(x,y),
Wherein z represents a z channel, z is an integer which is more than or equal to 1 and less than or equal to 3, and the 1 st channel, the 2 nd channel and the 3 rd channel respectively represent a red channel, a green channel and a blue channel;
x is the horizontal coordinate of the pixel point in the image, and y is the longitudinal coordinate of the pixel point in the image.
4. The method of claim 2, wherein the computing the optical flow of the stack comprises:
the optical flow is regarded as a set of displacement vector fields between successive frames t and t + 1; noting the point (u, v) in the t-th frame, the optical flow of the t-th frame is IτThe calculation formula is as follows:
wherein: u ═ 1; w, v ═ 1: h, k ═ 1; l ]; w is the width of the second video image, h is the height of the second video image, L is the frame number of the second video image, and k is the k-th pixel point to finally obtain the stacked optical flow of N' × 2L.
5. The method of claim 2, wherein the step of extracting the spatial flow feature map or the temporal flow feature map comprises the steps of:
step 1: performing a filling operation on the input three primary color channel image or the stacked optical flow, expanding from N '. times.N' to (N '+ 7). times.N' +7, and filling the expanded portion with a value of 0; performing convolution on the three-primary-color channel images or the stacked optical flows respectively by using convolution kernels of 7 × 96 with the step size of 2 to generate a feature map;
step 2: and (3) linearly rectifying the characteristic diagram generated in the step 1 by using a ReLU, wherein the formula of a ReLU function is as follows:
ReLU(m)=max(0,m)
wherein m is an independent variable;
and step 3: performing maximum pooling operation on the feature map subjected to linear rectification, wherein the pooling size is 2 x 2;
and 4, step 4: convolving the feature map generated in the step 3, setting the size of a convolution kernel to be 5 × 256, the step size to be 2, and the pooling size to be 2 × 2;
and 5: convolving the feature map generated in the step 4, setting the size of a convolution kernel to be 3 × 512 and the step size to be 1;
step 6, performing convolution on the feature map generated in the step 5, setting the convolution kernel size to be 3 x 512 and the step size to be 1;
and 7: convolving the feature map generated in the step 6, and setting the convolution kernel size to be 3 × 512, the step size to be 1 and the pooling size to be 2 × 2;
512 space flow characteristic graphs S are obtainedpConsider 512 channels, where: p is e [1,512 ]]Each Sp(N '/32) × (N'/32) neurons;
512 time flow feature maps S obtainedqConsider 512 channels, where: q is an element of [1,512 ]]Each SqThere are (N '/32) × (N'/32) neurons.
6. The SimAM attention mechanism-based behavior recognition method of claim 5, wherein SimAM attention of spatial streams CNN and temporal streams CNN is calculated and associated with corresponding spatial stream profile SpAnd time flow profile SqRespectively fused into a space flow attention fusion characteristic diagram Sp' and time flow attention fusion feature map Sq' specifically, the method comprises the following steps:
SimAM attention calculation for spatial streams CNN and temporal streams CNN:
calculating an energy function e for each neuronr:
Where r represents a target neuron in a single input channel; q. q.siRepresenting other neurons in the input channel, i being a serial number;wrLinear conversion of the weights; brIs a biased linear transformation; (N '/32) × (N'/32) is the number of other neurons on the channel; gamma is a variable; λ is a coefficient;
calculate to obtain wr:
Calculated to give br:
μrRepresents the mean value and is calculated to give μr:
σr 2Expressing the variance, and calculating to obtain sigmar 2:
one neuron corresponds to one spiritChannel energyThe energy of all neurons of a single channel forms an energy matrix E of the channel, an attention weight matrix E' of the channel is obtained by taking the reciprocal of each element in the energy matrix E and normalizing the reciprocal through a sigmod function, and the calculation formula is as follows:
the fusing the SimAM attention of the spatial stream CNN and the temporal stream CNN with the spatial stream feature map and the temporal stream feature map respectively comprises:
the fused characteristic diagram S' has the calculation formula as follows:
S′=S·E′
wherein S represents a spatial stream feature map SpOr spatial flow profile SqIn the first place, S' represents a spatial flow attention fusion feature map Sp' Or time flow attention fusion feature map Sq' in one of them.
7. The SimAM attention mechanism-based behavior recognition method of claim 6, wherein the dual-flow attention fusion feature map fused by the spatial flow attention fusion feature map and the temporal flow attention fusion feature map comprises the following steps:
feature map S fused with spatial stream attentionp' and time flow attention fusion feature map Sq' carrying out cascade fusion:
wherein:feature map S output for spatial stream networkp' Upper position (i, j, d), s represents space,characteristic diagram S output for its corresponding time flow networkq' Upper position (i, j, d), t represents time,the position (i, j,2d) on the feature map is obtained for cascade fusion,obtaining a position (i, j,2d-1) on the feature map for cascade fusion, wherein d represents the d-th feature map;
obtaining a cascade fusion characteristic diagram: a size of (N'/32) × 1024; and (3) performing three-layer convolution on the cascade fusion feature map successively, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, and obtaining a double-flow attention fusion feature map SdIs (N'/32) × 512.
8. The method of claim 7, wherein the SimAM attention mechanism-based behavior recognition method,
the full connectivity layer is classified as:
output=f(wTA+b)
w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and output dimension is H x 1; the input vector A is a double-flow attention fusion characteristic diagram SdTime flow attention fusion feature map Sq' or spatial flow attention fusion feature map Sp' to (1);
the output is one of a spatial flow network model output s _ output, a temporal flow network model output t _ output, or a dual-flow network model output d _ output.
9. The method of claim 8, wherein the SimAM attention mechanism-based behavior recognition method,
solving the final behavior recognition result specifically includes:
the softmax function expression is:
wherein: outputiRepresents the ith element in output, outputkRepresenting the Kth element in output, P representing probability, and exp () representing an exponential function with a natural logarithm e as a base number; h is the number of elements;
the output of the softmax is d _ output or the average value of s _ output and t _ output;
and the behavior classification label corresponding to the element with the highest probability in the H elements is the final behavior identification result.
10. The method of claim 1, wherein the method comprises: the spatial stream CNN or the spatial stream CNN is trained by adopting a random gradient descent algorithm;
the gradient of the output layer is calculated,the gradient of the kth node of the output layer (kth layer) is represented, and the calculation formula is as follows:
wherein o iskRepresenting the output of the kth node of the K layer, tkA label representing a kth node of a kth layer;
the gradient of the hidden layer is calculated,the gradient of the ith node of the ith hidden layer is represented, and the calculation formula is as follows:
wherein o isiA label value representing the ith node of the ith hidden layer,represents the gradient of the ith node of the layer above the I-th hidden layer (J-th layer), wabRepresenting the value of the a row and the b column of the weight matrix; and updating the parameters according to the calculated gradient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111572410.XA CN114387539A (en) | 2021-12-21 | 2021-12-21 | Behavior identification method based on SimAM attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111572410.XA CN114387539A (en) | 2021-12-21 | 2021-12-21 | Behavior identification method based on SimAM attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114387539A true CN114387539A (en) | 2022-04-22 |
Family
ID=81197576
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111572410.XA Pending CN114387539A (en) | 2021-12-21 | 2021-12-21 | Behavior identification method based on SimAM attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114387539A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114886436A (en) * | 2022-05-10 | 2022-08-12 | 广西师范大学 | Premature beat identification method based on improved convolutional neural network |
-
2021
- 2021-12-21 CN CN202111572410.XA patent/CN114387539A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114886436A (en) * | 2022-05-10 | 2022-08-12 | 广西师范大学 | Premature beat identification method based on improved convolutional neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111091045B (en) | Sign language identification method based on space-time attention mechanism | |
CN111144329B (en) | Multi-label-based lightweight rapid crowd counting method | |
CN112347888B (en) | Remote sensing image scene classification method based on bi-directional feature iterative fusion | |
CN110659727A (en) | Sketch-based image generation method | |
CN112836646B (en) | Video pedestrian re-identification method based on channel attention mechanism and application | |
CN113420838B (en) | SAR and optical image classification method based on multi-scale attention feature fusion | |
CN114004847B (en) | Medical image segmentation method based on graph reversible neural network | |
Bai et al. | A lightweight and multiscale network for remote sensing image scene classification | |
WO2022227292A1 (en) | Action recognition method | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
CN113420179B (en) | Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
CN113807318A (en) | Action identification method based on double-current convolutional neural network and bidirectional GRU | |
CN114463235A (en) | Infrared and visible light image fusion method and device and storage medium | |
CN114387539A (en) | Behavior identification method based on SimAM attention mechanism | |
CN115272670A (en) | SAR image ship instance segmentation method based on mask attention interaction | |
CN112528077B (en) | Video face retrieval method and system based on video embedding | |
Zhang et al. | A scale adaptive network for crowd counting | |
CN117132885A (en) | Hyperspectral image classification method, hyperspectral image classification system and storage medium | |
CN113780305B (en) | Significance target detection method based on interaction of two clues | |
Cheng et al. | Exploit the potential of multi-column architecture for crowd counting | |
CN116469172A (en) | Bone behavior recognition video frame extraction method and system under multiple time scales | |
CN114863174A (en) | Small sample classification algorithm based on multi-scale attention feature fusion | |
Wang | Micro-expression Recognition Based on Multi-Scale Attention Fusion | |
CN114627370A (en) | Hyperspectral image classification method based on TRANSFORMER feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |