CN114387539A

CN114387539A - Behavior identification method based on SimAM attention mechanism

Info

Publication number: CN114387539A
Application number: CN202111572410.XA
Authority: CN
Inventors: 王仲文; 胡凯; 庞子超; 解帅; 崔梦宇
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-22

Abstract

The invention discloses a behavior recognition method based on a SimAM attention mechanism, which makes full use of the non-parameter characteristic of a non-parameter attention module SimAM and organically combines the non-parameter characteristic with a double-current network model, so that the recognition accuracy can be improved, network parameters and training difficulty can be reduced, different fusion modes can be adaptively adopted according to different high-low layer characteristic contents (namely resolutions) of input videos, the perception capability of details is improved, a high-performance behavior recognition algorithm is obtained, the double-current network model is more effectively optimized, and the engineering property and the accuracy of the behavior recognition algorithm of the scheme are improved.

Description

Behavior identification method based on SimAM attention mechanism

Technical Field

The invention belongs to a computer vision technology in the field of artificial intelligence, and particularly relates to a behavior identification method based on a SimAM attention mechanism.

Technical Field

Under the background of the era of internet big data, more and more videos are shared, and the method for quickly extracting information from massive video resources has extremely high research and application values. Human behavior recognition in videos also gradually becomes a great research hotspot in the field of computer vision, and is widely applied in the fields of public video monitoring, human-computer interaction, scientific cognition, medical rehabilitation and the like. In recent years, with the increasing level of computer computing power, deep learning has been developed, and behavior recognition algorithms based on deep learning are gradually turning out.

At present, behavior recognition algorithms based on deep learning are mainly classified into three categories:

(1) single flow network model method. A three-dimensional convolutional neural network proposed by Shuiwang Ji et al, i.e., 3D CNN, is widely used in this network model method. The convolution network structure comprises a hard connection layer, three-dimensional convolution layers, two pooling layers and a full connection layer.

(2) A dual-flow network model method. Simony et al was inspired by the two-stream hypothesis of neuroscience, and the two-stream network model was first creatively proposed in 2014. The two-flow hypothesis assumes that the visual cortex has ventral pathways that are sensitive to the shape and color response of the target and dorsal pathways that are sensitive to the spatial translation response caused by the motion of the target. The double-current network architecture simulates a visual cortex to establish a time information channel and a space information channel, adopts mutually independent parallel CNN networks to extract time characteristics and space characteristics of a video, and finally performs characteristic fusion.

(3) Multi-stream network model approach. The network model method is an expansion of a double-flow network model, and other CNN networks are added on the basis of the double-flow network model to extract different characteristics.

Throughout the development process of deep learning, significant breakthrough and discovery are provided on the basis of certain neuroscience theory. In the visual neuroscience, the most informative neurons are typically those that exhibit unique firing patterns of peripheral neurons. Furthermore, active neurons may also inhibit the activity of peripheral neurons, a phenomenon known as spatial inhibition. In visual processing, neurons that exhibit significant spatial suppression should be prioritized. The simplest implementation to find these neurons is to measure the linear separation between one target neuron and the other. Recently Linxiao Yang et al proposed a Parameter-Free Attention Module (SimAM, A Simple, Parameter-Free Attention Module for a Convolutional Neural network) of a Convolutional Neural network based on the spatial inhibitory effect of such neurons, which forms the Attention of the corresponding neuron by calculating an energy function of each neuron to study the importance of each neuron.

The double-flow network model method mentioned above has good generalization and expansibility, but it is successively found that the double-flow network model fuses the recognition result by late fusion of the time flow and the spatial flow, and has poor detail perception capability. Therefore, many improved models are proposed on the basis of the model, for example, an attention mechanism module is added to a dual-flow network model to form a structure of a "dual-flow network model + attention mechanism", and the existing attention mechanism is provided with channel attention, spatial attention and the like. The double-flow network model is combined with the attention mechanism, the recognition accuracy is effectively improved, and the complexity and the training difficulty of the model are greatly improved, so that the further application of the model is influenced.

Disclosure of Invention

The invention aims to solve the technical problem of providing a behavior identification method based on a SimAM attention mechanism based on the existing artificial intelligence and machine vision technology aiming at the defects of the existing behavior identification algorithm related in the technical background.

The principle is as follows:

the video features are extracted by using the existing double-flow network model, the features are further strengthened by combining with a lightweight SimAM attention mechanism, different feature fusion modes are adopted according to different high and low level feature contents in the video, the identification accuracy is improved, the network parameters and the training difficulty are reduced, and the detail perception capability is improved.

A behavior identification method based on a SimAM attention mechanism comprises the following steps:

sampling an input video, and randomly extracting a sample feature map, wherein the sample feature map comprises: scaling and cutting a sample characteristic graph by using a first video image of a single frame and a second video image of a plurality of continuous frames;

spatial stream feature extraction and temporal stream feature extraction: carrying out three-primary color channel decomposition on the first video image to obtain a three-primary color channel image; computing a stacked optical flow for the second video image; inputting a single frame of input spatial stream CNN of a three-primary color channel image to extract a spatial stream characteristic diagram, and inputting a stacked optical stream into a time stream CNN to extract a time stream characteristic diagram;

respectively calculating SimAM attention of the spatial stream CNN and the time stream CNN, and respectively fusing the SimAM attention with the corresponding spatial stream characteristic diagram and time stream characteristic diagram into a spatial stream attention fusion characteristic diagram and a time stream attention fusion characteristic diagram;

early fusion and output of features: the method comprises the steps of using an early fusion mode to fuse features of input videos containing more high-level features, enabling the resolution of single-frame input videos to be larger than 1080p, inputting fusion features of a spatial flow attention fusion feature map and a temporal flow attention fusion feature map into a full-link layer for classification, and outputting results through a softmax function;

and (4) late fusion and output of results: and for the input video containing more low-layer features, fusing the features in a late fusion mode, wherein the resolution of a single-frame input video is less than 1080p, inputting a spatial stream attention fusion feature map into a full-link layer for classification in a spatial stream CNN, inputting a temporal stream attention fusion feature map into the full-link layer for classification in a temporal stream CNN, adopting a mean fusion output value, and finally inputting the output value into a softmax function output result.

The low-level features refer to features such as frames, outlines, positions and sizes, and the high-level features refer to features such as textures; generally, the higher the resolution, the more high-level features are contained.

Further, the specific steps of scaling and clipping the sample feature map are as follows: and scaling the sample characteristic diagram to be N × N, and then randomly cutting the sample characteristic diagram to be N ' × N ', wherein N represents the pixel size of the image, and N ' represents the pixel size of the image after random cutting.

Further, the three primary color channel image acquisition method specifically includes the following steps:

decomposing the zoomed and cut first video image into three channels of red, green and blue to obtain an image X under the three channels_z(x,y)，

Wherein z represents a z channel, z is an integer which is more than or equal to 1 and less than or equal to 3, and the 1 st channel, the 2 nd channel and the 3 rd channel respectively represent a red channel, a green channel and a blue channel;

x is the horizontal coordinate of the pixel point in the image, and y is the longitudinal coordinate of the pixel point in the image.

Further, the calculating of the optical flow of the stack specifically comprises the steps of:

the optical flow is regarded as a set of displacement vector fields between successive frames t and t + 1; noting the point (u, v) in the t-th frame, the optical flow of the t-th frame is I_τThe calculation formula is as follows:

wherein: u ═ 1; w, v ═ 1: h, k ═ 1; l ]; w is the width of the second video picture, h is the height of the second video picture, L is the number of frames of the second video picture,

finally, a stacked optical flow of N' × 2L is obtained.

Further, the extracting of the spatial flow feature map or the temporal flow feature map specifically includes the following steps:

step 1: performing a filling operation on the input three primary color channel image or the stacked optical flow, expanding from N '. times.N' to (N '+ 7). times.N' +7, and filling the expanded portion with a value of 0; performing convolution on the three-primary-color channel images or the stacked optical flows respectively by using convolution kernels of 7 × 96 with the step size of 2 to generate a feature map;

step 2: and (3) linearly rectifying the characteristic diagram generated in the step 1 by using a ReLU, wherein the formula of a ReLU function is as follows:

ReLU(m)＝max(0,m)

wherein m is an independent variable; max is the maximum of 0 and m;

and step 3: performing maximum pooling operation on the feature map subjected to linear rectification, wherein the pooling size is 2 x 2;

and 4, step 4: convolving the feature map generated in the step 3, setting the size of a convolution kernel to be 5 × 256, the step size to be 2, and the pooling size to be 2 × 2;

and 5: convolving the feature map generated in the step 4, setting the size of a convolution kernel to be 3 × 512 and the step size to be 1;

step 6, performing convolution on the feature map generated in the step 5, setting the convolution kernel size to be 3 x 512 and the step size to be 1;

and 7: convolving the feature map generated in the step 6, and setting the convolution kernel size to be 3 × 512, the step size to be 1 and the pooling size to be 2 × 2;

finally, 512 space flow characteristic graphs S are respectively obtained_pConsidered as 512 channels, wherein: p is e [1,512 ]]Each S_p(N '/32) × (N'/32) neurons;

time flow profile S_qConsidered as 512 channels, wherein: q is an element of [1,512 ]]Each S_qThere are (N '/32) × (N'/32) neurons.

Further, the SimAM attention of the spatial stream CNN and the temporal stream CNN are calculated respectively, and are associated with the corresponding spatial stream signature S_pAnd time flow profile S_qRespectively fused into a space flow attention fusion feature map S'_pAnd time streamerIntention fusion feature map S'_qThe method specifically comprises the following steps:

step 1: SimAM attention calculation for spatial streams CNN and temporal streams CNN:

calculating an energy function e for each neuron_r：

Where r represents a target neuron in a single input channel; q. q.s_iRepresenting other neurons in the input channel, i being a serial number; w is a_rLinear conversion of the weights; b_rIs a biased linear transformation; (N '/32) × (N'/32) is the number of other neurons on the channel; gamma is a variable; λ is a coefficient;

w is calculated using the following formula_r：

B is calculated using the following formula_r：

μ_rThe mean value is expressed by the formula:

σ_r ²the variance is expressed by the formula:

calculating minimum neuron energy

The smaller the energy of each neuron, the more important it is in comparison to other neurons in the same channel, the reciprocal of the minimum neuron energy

Is the weight of the neuron;

one neuron corresponds to one neuron energy

The energy of all neurons of a single channel forms an energy matrix E of the channel, an attention weight matrix E' of the channel is obtained by taking the reciprocal of each element in the energy matrix E and normalizing the reciprocal through a sigmod function, and the calculation formula is as follows:

step 2: the method for fusing the SimAM attention of the spatial stream CNN and the temporal stream CNN with the spatial stream feature map and the temporal stream feature map respectively comprises the following steps:

the fused characteristic diagram S' has the calculation formula as follows:

S′＝S·E′

wherein S represents a spatial stream feature map S_pOr spatial flow profile S_qS 'represents a spatial stream attention fusion feature map S'_pOr time-flow attention fusion feature map S'_qOne of them.

Further, the dual-flow attention fusion feature map fused by the spatial flow attention fusion feature map and the temporal flow attention fusion feature map specifically includes:

feature map S 'is fused to spatial stream attention'_pAnd time-flow attention fusion feature map S'_qThe cascade fusion is carried out, and the fusion is carried out,

wherein:

feature map S 'output by space flow network'_pAt some position (i, j, d), s represents space,

is a feature map S 'output by the corresponding time flow network'_qAt a certain position (i, j, d), t represents time,

a certain position on the feature map is obtained for cascade fusion,

obtaining a certain position (i, j,2d-1) on the feature map for cascade fusion, wherein d represents the d-th feature map;

obtaining a cascade fusion characteristic diagram: a size of (N'/32) × 1024;

and (3) carrying out three-layer convolution successively on the cascade fusion feature map, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, the last layer of convolution plays a role in reducing dimensionality, and finally the double-flow attention fusion feature map S is obtained_dThe size is (N'/32) × 512.

Further, in the above-mentioned case,

the full connectivity layer is classified as:

output＝f(w^TA+b)

w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and output dimension is H x 1;

the input vector A is a double-flow attention fusion characteristic diagram S_dTime-flow attention fusion feature map S'_qOr space flow attention fusion feature map S'_pOne of (1);

the obtained output is one of output s _ output of a space flow network model, output t _ output of a time flow network model or output d _ output of a double-flow network model;

further, the method comprises the following steps of;

solving the final behavior recognition result specifically includes:

the softmax function expression is:

wherein: output_iRepresents the ith element in output, output_kRepresenting the Kth element in output, P representing probability, and exp () representing an exponential function with a natural logarithm e as a base number; h is the number of elements;

the output of the softmax is d _ output or the average value of s _ output and t _ output;

and the behavior classification label corresponding to the element with the highest probability in the H elements is the final behavior identification result.

Further: the spatial stream CNN or the spatial stream CNN is trained by adopting a random gradient descent algorithm;

the gradient of the output layer is calculated,

the gradient of the Kth node of the output layer (Kth layer) is represented, and the calculation formula is as follows:

wherein o is_kRepresenting the output of the kth node of the K layer, t_kA label representing a kth node of a kth layer;

the gradient of the hidden layer is calculated,

the gradient of the ith node of the ith hidden layer is represented, and the calculation formula is as follows:

wherein o is_iA label value representing the ith node of the ith hidden layer,

represents the gradient of the ith node of the layer above the I-th hidden layer (J-th layer), w_abRepresenting the value of the a row and the b column of the weight matrix; and updating the parameters according to the calculated gradient.

The main innovation points of the invention are as follows:

(1) combining a double-current network model with a lightweight SimAM attention mechanism;

(2) and adopting different feature fusion modes according to different high and low level feature contents of the video.

Compared with the prior art, the invention has the following advantages: the behavior recognition method based on the SimAM attention mechanism provided by the invention uses the CNN-M-1024 network in the space stream and the time stream respectively, can fully extract the space information and the time information of the video, is fused with the SimAM attention mechanism, improves the accuracy of video behavior recognition under the condition of not increasing network parameters, and realizes the balance of model accuracy and training complexity. The method adopts different fusion modes according to different adaptability of the high-low layer characteristic content of the input video, and further improves the accuracy of the model.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention using an early fusion approach;

fig. 2 is a flowchart of an embodiment of the present invention using a late fusion mode.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the present invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the drawings, components are exaggerated for clarity.

The detailed steps of the present invention are given below:

the behavior identification method based on the SimAM attention mechanism comprises the following steps:

pretreatment part (step 1 to step 2):

step 1), sampling an input video by using an open UCF-101 data set, and firstly randomly extracting a frame of image which is marked as img_aWherein a is a group symbol. Then randomly extracting ten continuous frames of images and recording as img_biWhere i ∈ [1,10 ]]And b is a group mark;

step 2), carrying out the same zooming and clipping on the eleven frames of images obtained by sampling; firstly, scaling the image to 256 × 256, and then randomly cutting the image to 224 × 224;

spatial stream feature extraction section (step 3 to step 10):

step 3), input image img_aDecomposing into red, green and blue channels to obtain image X under three channels_z(x, y), wherein z represents a z channel, z is an integer which is greater than or equal to 1 and less than or equal to 3, and the 1 st, 2 nd and 3 rd channels respectively represent a red channel, a green channel and a blue channel; x and y are respectively the horizontal coordinate and the vertical coordinate of a pixel point in the image;

step 4), for X_z(x, y) performing a fill operation to expand the image from 224 x 224 to 231 x 231, the expanded portion being filled with a value of 0; using convolution kernels of 7X 96, X is each pair with a step size of 2_zConvolution by convoluting the convolution kernel to conv_1j(m, n) where j ∈ [1,96 ]]；m,n∈[0,6]J represents the jth convolution kernel; m and n are subscripts of the m-th row element and the n-th column element in the convolution kernel respectively. The convolution process can be represented by the following steps:

step 4.1), calculating the convolution of a single channel, wherein the process can be expressed by the following formula:

wherein x belongs to [0,112], y belongs to [0,112], and the symbol represents multiplication of corresponding elements of the matrix;

step 4.2), calculating the sum of the convolutions of the three channels, wherein the process can be expressed by the following formula:

step 5), the generated characteristic diagram F is subjected to_j(x, y) is linearly rectified, and the process can be expressed by the following formula:

F_j′(x,y)＝ReLU(F_j′(x,y))

wherein the formula for ReLU is:

ReLU(x)＝max(0,x)

step 6), carrying out linear rectification on the characteristic diagram F_j' (x, y) performing a maximum pooling operation with a pooling size of 2 x 2, and finally generating 56 x 56 signature F_j"(x, y), the process can be expressed as follows:

F_j″(x,y)＝max(F_j′(2x,2y),F_j′(2x+1,2y),F_j′(2x,2y+1),F_j′(2x+1,2y+1))

wherein x belongs to [0,56], y belongs to [0,56 ];

step 7), performing convolution on the feature map, setting the convolution kernel size to be 5 × 256, the step size to be 2, and the pooling size to be 2 × 2, so as to obtain the feature map of 14 × 256, wherein the convolution process is the same as the steps 4 to 6;

step 8), performing convolution on the feature map, setting the size of a convolution kernel to be 3 × 512, setting the step size to be 1, obtaining the feature map of 14 × 512, and performing the convolution process in the same step 4;

step 9), performing convolution on the feature map, setting the size of a convolution kernel to be 3 × 512, setting the step size to be 1, obtaining the feature map of 14 × 512, and performing the convolution process in the same step 4;

and step 10), performing convolution on the feature map, setting the convolution kernel size to be 3 × 512, the step size to be 1, and the pooling size to be 2 × 2, and finally obtaining 7 × 512 feature maps, wherein the convolution process is the same as that from step 4 to step 6.

Spatial flow SimAM part (step 11 to step 13):

step 11), taking the 512 spatial stream feature maps of the spatial stream network obtained in step 10 as 512 channels, which are marked as S_pWhere p ∈ [1,512 ]](ii) a Each S_pThere are 49 neurons in total, 7 by 7; calculating an energy function e for each neuron_rThe calculation formula is as follows:

where r represents a target neuron in a single input channel; q. q.s_iRepresenting other neurons in the input channel, i being a serial number; w is a_rLinear conversion of the weights; b_rIs a biased linear transformation; m49 is the number of other neurons on the channel; gamma is a variable; λ is a coefficient.

W is calculated using the following formula_r：

B is calculated using the following formula_r：

μ_rThe mean value is expressed by the formula:

σ_r ²the variance is expressed by the formula:

calculating minimum neuron energy

The smaller the energy per neuron, the more important it is in comparison to other neurons in the same channel, so the reciprocal of the minimum neuron energy is used

Representing the weight of the neuron.

Step 12) the energy of all neurons of a single channel forms an energy matrix E of the channel, an attention weight matrix E' of the channel is obtained by taking the reciprocal of each element in the energy matrix E and normalizing the reciprocal through a sigmod function, and the calculation formula is as follows:

step 13), fusing the feature map with the channel attention weight, and calculating a fused feature map S'_pThe calculation formula is as follows:

S′_p＝S_p·E′,p∈[1,512]

time stream feature extraction section (step 14 to step 15):

step 14), calculating a stack of optical flows; the dense optical flow can be viewed as a set of displacement vector fields between successive frames t and t + 1; noting the point (u, v) in the t-th frame, the optical flow of the t-th frame is I_τThe calculation formula is as follows:

wherein u ═ 1; w, v ═ 1: h, k ═ 1; l ], w is the width of the second video picture, h is the height of the second video picture, and L is the number of frames of the second video picture.

Frame number L takes 10, which ultimately results in a stacked optical flow input of 224 × 20.

Step 15), performing convolution on the feature map for 5 times, wherein the convolution step is the same as the steps from step 4 to step 10;

time-flow SimAM part (step 16 to step 17):

step 16), calculating attention in the time flow network model, and synchronizing the steps 11 to 12;

step 17), fusing the feature map and the channel attention weight, and synchronizing step 13;

a feature fusion and output part (step 18 to step 23):

if the resolution of one dimension of the input video is greater than 1080p, adopting steps 18-21;

feature early fusion part (step 18 to step 19):

step 18), recording the value of a certain position (i, j, d) on the characteristic diagram output by the spatial stream network as

s represents a space corresponding to a position (i, j, d) on the profile of the time stream network output having a value of

t represents time, and the position obtained by cascade fusion is

cat represents the cascade, which is calculated as follows:

the size of the feature map obtained by cascade fusion is 7 × 1024;

step 19), carrying out three-layer convolution successively on the feature maps obtained by cascade fusion, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, the last layer of convolution plays a role in reducing dimensionality, and the size of the finally obtained output feature map is 7 × 512 and is consistent with the size of the original feature map; the convolution process is the same as the steps 4 to 6;

full connection (step 20):

step 20), sending the feature diagram of the fused SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain the output d _ output of the dual-flow network model, wherein d represents time and space dual flow, and the calculation formula is as follows,

output＝f(w^TA+b)

w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and the output dimension is 101 x 1.

Sending the output value d _ output of the merged output into a softmax function to obtain a final classification result;

if the resolution of all dimensions of the input video is less than 1080p, adopting the steps 22-23;

late fusion of results (step 21 to step 23):

step 21), sending the feature diagram of the fusion SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain an output s _ output of the spatial flow network model, wherein s represents a space, and synchronizing the step 20;

step 22), sending the feature diagram of the fusion SimAM attention mechanism with the size of 7 × 512 into two full-connection layers respectively containing 1024 and 101 neurons to obtain the output t _ output of the time flow network model, wherein t represents time, and synchronizing the step 20;

step 23), the output average values of the spatial flow network model and the time flow network model are fusedObtaining output values of a dual-flow network model

And sending the softmax function to obtain a final behavior recognition result.

Output section (step 24)

Step 24), the expression of the softmax function is:

wherein: output_iRepresents the ith element in output, output_kRepresenting the kth element in output, P representing probability, exp () representing an exponential function with the natural logarithm e as the base; and the behavior classification label corresponding to the element with the highest probability in the 101 elements is the final behavior recognition result.

The neural network training part:

step 25), training a neural network by using a stochastic gradient descent algorithm (SGD);

step 25.1), calculating the gradient of the output layer,

wherein o is_kRepresenting the output of the kth node of the K layer, t_kA label representing the kth node of the kth level.

Step 25.2), the gradient of the hidden layer is calculated,

wherein o is_iA label value representing the ith node of the ith hidden layer,

represents the gradient of the ith node of the layer above the I-th hidden layer (J-th layer), w_abRepresenting the values of the a-th row and the b-th column of the weight matrix.

And 25.3) updating the parameters according to the calculated gradient.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A behavior identification method based on a SimAM attention mechanism is characterized by comprising the following steps:

carrying out three-primary color channel decomposition on the first video image to obtain a three-primary color channel image; computing a stacked optical flow for the second video image;

inputting a single frame of input spatial stream CNN of a three-primary color channel image to extract a spatial stream characteristic diagram, and inputting a stacked optical stream into a time stream CNN to extract a time stream characteristic diagram;

calculating SimAM attention of the spatial stream CNN and the time stream CNN, and fusing the SimAM attention with corresponding spatial stream characteristic diagrams and time stream characteristic diagrams respectively to form corresponding spatial stream attention fusion characteristic diagrams and time stream attention fusion characteristic diagrams;

for an input video containing more high-level features, inputting a double-flow attention fusion feature map fused by a space flow attention fusion feature map and a time flow attention fusion feature map into full-link layer classification, and inputting an output value of the double-flow attention fusion feature map into a softmax function to solve a final behavior recognition result; wherein the higher layer feature is represented by a single frame input video resolution greater than 1080 p;

for an input video containing more low-level features, inputting a spatial stream attention fusion feature map into a full-link layer for classification in a spatial stream CNN, inputting a temporal stream attention fusion feature map into the full-link layer for classification in a temporal stream CNN, and inputting a softmax function after mean fusion to solve a final behavior recognition result for two network output values.

2. The method of claim 1, wherein scaling and cropping the sample feature map comprises: and scaling the sample characteristic diagram to be N × N, and then randomly cutting the sample characteristic diagram to be N ' × N ', wherein N represents the pixel size of the image, and N ' represents the pixel size of the image after random cutting.

3. The behavior recognition method based on the SimAM attention mechanism as claimed in claim 2, wherein the three primary color channel image acquisition method comprises:

4. The method of claim 2, wherein the computing the optical flow of the stack comprises:

wherein: u ═ 1; w, v ═ 1: h, k ═ 1; l ]; w is the width of the second video image, h is the height of the second video image, L is the frame number of the second video image, and k is the k-th pixel point to finally obtain the stacked optical flow of N' × 2L.

5. The method of claim 2, wherein the step of extracting the spatial flow feature map or the temporal flow feature map comprises the steps of:

ReLU(m)＝max(0,m)

wherein m is an independent variable;

512 space flow characteristic graphs S are obtained_pConsider 512 channels, where: p is e [1,512 ]]Each S_p(N '/32) × (N'/32) neurons;

512 time flow feature maps S obtained_qConsider 512 channels, where: q is an element of [1,512 ]]Each S_qThere are (N '/32) × (N'/32) neurons.

6. The SimAM attention mechanism-based behavior recognition method of claim 5, wherein SimAM attention of spatial streams CNN and temporal streams CNN is calculated and associated with corresponding spatial stream profile S_pAnd time flow profile S_qRespectively fused into a space flow attention fusion characteristic diagram S_p' and time flow attention fusion feature map S_q' specifically, the method comprises the following steps:

SimAM attention calculation for spatial streams CNN and temporal streams CNN:

calculating an energy function e for each neuron_r：

Where r represents a target neuron in a single input channel; q. q.s_iRepresenting other neurons in the input channel, i being a serial number；w_rLinear conversion of the weights; b_rIs a biased linear transformation; (N '/32) × (N'/32) is the number of other neurons on the channel; gamma is a variable; λ is a coefficient;

calculate to obtain w_r：

Calculated to give b_r：

μ_rRepresents the mean value and is calculated to give μ_r：

σ_r ²Expressing the variance, and calculating to obtain sigma_r ²：

Calculating minimum neuron energy

Reciprocal of minimum neuron energy

Is the weight of the neuron;

one neuron corresponds to one spiritChannel energy

the fusing the SimAM attention of the spatial stream CNN and the temporal stream CNN with the spatial stream feature map and the temporal stream feature map respectively comprises:

the fused characteristic diagram S' has the calculation formula as follows:

S′＝S·E′

wherein S represents a spatial stream feature map S_pOr spatial flow profile S_qIn the first place, S' represents a spatial flow attention fusion feature map S_p' Or time flow attention fusion feature map S_q' in one of them.

7. The SimAM attention mechanism-based behavior recognition method of claim 6, wherein the dual-flow attention fusion feature map fused by the spatial flow attention fusion feature map and the temporal flow attention fusion feature map comprises the following steps:

feature map S fused with spatial stream attention_p' and time flow attention fusion feature map S_q' carrying out cascade fusion:

wherein:

feature map S output for spatial stream network_p' Upper position (i, j, d), s represents space,

characteristic diagram S output for its corresponding time flow network_q' Upper position (i, j, d), t represents time,

the position (i, j,2d) on the feature map is obtained for cascade fusion,

obtaining a position (i, j,2d-1) on the feature map for cascade fusion, wherein d represents the d-th feature map;

obtaining a cascade fusion characteristic diagram: a size of (N'/32) × 1024; and (3) performing three-layer convolution on the cascade fusion feature map successively, wherein the sizes of convolution kernels are respectively 3 × 512, 3 × 1024 and 1 × 512, and obtaining a double-flow attention fusion feature map S_dIs (N'/32) × 512.

8. The method of claim 7, wherein the SimAM attention mechanism-based behavior recognition method,

the full connectivity layer is classified as:

output＝f(w^TA+b)

w is a weight vector, T represents transposition, A is an input vector, b is a bias vector, and output dimension is H x 1; the input vector A is a double-flow attention fusion characteristic diagram S_dTime flow attention fusion feature map S_q' or spatial flow attention fusion feature map S_p' to (1);

the output is one of a spatial flow network model output s _ output, a temporal flow network model output t _ output, or a dual-flow network model output d _ output.

9. The method of claim 8, wherein the SimAM attention mechanism-based behavior recognition method,

solving the final behavior recognition result specifically includes:

the softmax function expression is:

10. The method of claim 1, wherein the method comprises: the spatial stream CNN or the spatial stream CNN is trained by adopting a random gradient descent algorithm;

the gradient of the output layer is calculated,

the gradient of the hidden layer is calculated,

wherein o is_iA label value representing the ith node of the ith hidden layer,