[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115761888A - Tower crane operator abnormal behavior detection method based on NL-C3D model - Google Patents

Tower crane operator abnormal behavior detection method based on NL-C3D model Download PDF

Info

Publication number
CN115761888A
CN115761888A CN202211462437.8A CN202211462437A CN115761888A CN 115761888 A CN115761888 A CN 115761888A CN 202211462437 A CN202211462437 A CN 202211462437A CN 115761888 A CN115761888 A CN 115761888A
Authority
CN
China
Prior art keywords
network
convolution
model
image
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211462437.8A
Other languages
Chinese (zh)
Inventor
邓珍荣
李志宏
蓝如师
杨睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202211462437.8A priority Critical patent/CN115761888A/en
Publication of CN115761888A publication Critical patent/CN115761888A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a tower crane operator abnormal behavior detection method based on an NL-C3D model, which comprises the following steps of: 1) Collecting a monitoring video data set of a tower crane operation operator in an operation process; 2) Dividing video data into image frames through an algorithm, and then cutting the image size of the image frames; 3) Fusing a non-local module in a C3D network to obtain an NL-C3D network model; 4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The method improves the detection precision and ensures more detailed detection.

Description

Tower crane operator abnormal behavior detection method based on NL-C3D model
Technical Field
The invention belongs to the field of behavior identification in computer vision, and relates to an abnormal behavior identification and detection method, in particular to a tower crane operator abnormal behavior detection method based on an NL-C3D model.
Background
With the rapid development of video monitoring technology, the technology has been widely applied to various industries, the data volume of monitoring videos is rapidly increased, and abnormal behavior detection becomes an important research task, and particularly, the corresponding abnormal behavior detection becomes a difficult point of research in various fields of designing security and protection.
Conventional abnormal behavior detection methods are, for example, such as those based on manual features, which record characteristic patterns of motion using low-level trajectory features, histogram of Oriented Flow (HOF), histogram of Oriented Gradient (HOG), and the like. But the use of complex scenes of surveillance video is not recommended because manual means are not sufficient to characterize the behavior. In addition, new methods based on deep learning are emerging continuously, and there are methods based on a recurrent neural network, such as a high-precision analysis model proposed by Yeung et al based on RNN and then training by reinforcement learning, and a long short-term memory network (LSTM) proposed by esccora et al, which are very inefficient in processing long videos, and the extracted basic features do not support joint training. In addition, there is a two-stage detection method: some possible regions are pre-selected from the video and then these candidate regions are classified. This type of method also has the problem of time consuming and inefficient re-preselection of regions, and the phased approach may find the optimal solution locally and not guarantee a globally optimal solution. When processing video, the main focus of these networks is analysis of the current frame, and analysis of the previous and subsequent frames of the current frame is insufficient, which is very important as context information of the video for the consecutive movement of people in the video.
Disclosure of Invention
The invention aims to provide a tower crane operator abnormal behavior detection method based on an NL-C3D model, aiming at the problems in the prior art. The method strengthens the context modeling capability of the conventional convolutional neural network by optimizing the network structure and integrating global features, and has better performance in behavior identification and detection in videos.
The technical scheme for realizing the purpose of the invention is as follows:
a tower crane operator abnormal behavior detection method based on an NL-C3D model comprises the following steps:
1) Collecting a monitoring video data set related to the operation process of tower crane operators, and dividing the video data set into a training set, a verification set and a test set;
2) Dividing video data into image frames by an algorithm, then cutting the image size of the image frames to ensure the size of the images is consistent, and keeping a corresponding number of image frame samples, specifically, adjusting the imported video frames to a size that shape is [10,16,112,112,3], wherein 16 represents frame _ length, which means that the training size of each sample is 16 frames, and 112 represents crop _ size of the images, which means that the cut size of the video frames is 112 × 112 pixels, and 3 represents the number of input channels;
3) Fusing a non-local module in the C3D network to obtain an NL-C3D network model;
4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The process of collecting the video data set in the step 1) is as follows: taking an operation video by a camera, wherein the resolution is more than 320 pixels by 240 pixels, the frame rate is more than 25 frames per second, then, cutting one frame of the data set by every four frames to obtain a frame image, manually reducing the sampling step size of the image until meeting the regulation of at least sixteen frames for equipment which cannot make the entering time width of the network structure exceed sixteen frames according to the interval number, and dividing the data set into a training set, a verification set and a test set according to the ratio of 6.
Cutting the image size of the image frame in the step 2), and keeping a corresponding number of image frame samples, and the method specifically comprises the following steps: in the input processing process, in order to enhance the safety and the precision of the model, firstly, image frames are cut into 112 × 112 pixels randomly, then the initial address of a video frame of a selected network is determined in the output video frame, then a network input video frame of sixteen frames is selected on the address through a sliding window, the size of the selected video frame is 3 × 16 × 112 × 112, meanwhile, the data enhancement processing is realized by random inversion and subtraction operation sequentially performed along three paths of the image frames RGB, and finally, the graphics are marked by using graphics marking software Labelimg, and abnormal behaviors marked as 'calling', 'smoking', 'playing mobile phone' and 'dozing' are marked.
The step of fusing the non-local module in the C3D network in step 3) to obtain the NL-C3D network model includes:
3.1 The original C3D network model adopts 3D convolution and 3D pooling as main bodies, each main body is composed of 8 channels of 64, 128, 256, 512, 3D convolutional layers, 5 3D pooling layers, 2 full connection layers and a softmax classifier, and when the convolutional layers are fused with the non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolutional layers as a whole in a residual error connection mode, so that each convolutional layer is fused with a non-local neural network module;
3.2 The shape of the input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and after a non-local neural network module is fused in a C3D network model, the input X is respectively input with theta, W and C,
Figure BDA0003955864870000033
The convolution module of g, theta,
Figure BDA0003955864870000034
And g are respectively corresponding to 1 multiplied by 1 convolution with the step length of 1, and then the output results of the convolution modules are subjected to matrix dimension change;
3.3 Is a) is a reaction of theta with
Figure BDA0003955864870000035
The variable dimension calculation result of the step (A) is subjected to matrix addition to obtain a matrix of (C, C), then normalization analysis is completed through Softmax, and further matrix multiplication is completed on the analyzed result and the calculation result after the g branches are subjected to variable dimension;
3.4 And) the result obtained in the step 3.3) is subjected to dimension change and then input into a g convolution module, and finally the result is subjected to residual error addition with the result of the input X to obtain a C3D network model fused with the non-local neural network module, namely an NL-C3D network model.
The step of importing the image frames into the NL-C3D network model for training and checking in the step 4) comprises the following steps:
4.1 Transmitting video frames with the size of 3 × 16 × 112 × 112 into an NL-C3D network model, wherein the NL-C3D network hierarchy respectively comprises a 64-channel convolutional layer, a 128-channel convolutional layer, two 256-channel convolutional layers, two 512-channel convolutional layers, and two 512-channel convolutional layers, the layers above are respectively followed by a pooling layer, and then a full-link layer with two 2096 dimensions and a softmax layer, and finally outputting dimension information of [10, n ], wherein n is the category number of a data set used for training;
4.2 After fusing a non-local neural network to a convolutional layer, the NL-C3D network model enhances the local characteristics of the target by compressing the channel characteristics and aggregating global spatial features, first, statistics is performed on the similarity values between the pixel points at the current position and all the pixel points in the feature map, then feature weighted summation is performed on the region where the similarity values exist, so as to increase the feature information of the region, and further realize the effect of global characteristic improvement, non-local operation performs weighted summation on the value of a certain region and all the feature information of feature mapping, as shown in formula (1):
Figure BDA0003955864870000031
where x, y represent input features and output features, respectively, corresponding to feature images in graphics and video, both having the same dimensions, i represents the current position code of a feature point, and j represents other features in the feature imageCoding of points; function f (x) i ,x j ) Then represents x i And x j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) i ,x j ) Is a gaussian function, as shown in equation (2):
Figure BDA0003955864870000032
the normalization factor C (x) is expressed as shown in equation (3):
Figure BDA0003955864870000041
from the formula (1), since the non-local operation considers the relationship between the current address and all the positions in the characteristic position, it can effectively capture the multi-position dependence relationship of the video frame, and the connection between the convolution layer and the non-local neural network adopts a residual connection structure, when the non-local operation is implemented specifically, it will be converted into the form of matrix multiplication and convolution operation, after various operations and conversions, the output characteristic dimension Z has the same dimension as the input X, so it can be directly added to each convolution module of the network without modifying the network, and the convolution part with the non-local neural network can be defined as shown in the formula (4):
Z i =W z Y i +X i (4),
wherein Y is i Obtained by the operation of the formula (1), W z Then is the weight matrix,' + X i Representing residual connection, and obtaining the space-time characteristics of the video without interfering the original parameters and the initialization method in the model by using the residual connection mode;
4.3 In the convolution aspect, a three-dimensional convolution is used, which can perform convolution operation on adjacent frames to obtain information from the space and time dimensions of the video, so that the space data information and the time data information can be kept, compared with 2D convolution, one depth dimension is added in an input image, a convolution kernel is added by one dimension, and the input size is 3 × 16 × 112 × 112 for multiple channels;
4.4 To prevent overfitting, a Dropout layer is introduced into each layer of the NL-C3D network model, nodes in the neural network are eliminated randomly, lines connected with the nodes are deleted, complexity of the network is reduced, the Dropout rate of the model is rho, and the reserved probability is 1-rho;
4.5 The loss function part is an index for measuring the training of the network structure on the data set, the larger the value is, the more errors are, the loss function is used as a reference standard, and a cross entropy loss function is used in the model, as shown in formula (5):
H(p,q)=-∑p(x)log q(x) (5),
reflecting the difficulty degree of probability distribution p through probability distribution q, wherein p represents the probability of correct answers, q represents a predicted value, and the smaller the cross entropy is, the smaller the difference of the distribution values of the two probabilities is;
on this basis, the value probability of each class is found using the Softmax function, which is shown in equation (6):
Figure BDA0003955864870000042
wherein S represents the classification probability score of M for each result, and the average score S of each category is obtained through M 1 ,S 2 ,...,S M When the estimation is carried out, a certain category score is divided by the sum of various index scores to obtain an actual category with minimum loss on the basis, and then the probability of the category is maximum, so that a classification result is obtained.
Compared with the prior art, the technical scheme has the following advantages:
1. the method of the technical scheme can accurately detect the human behavior in the video;
2. the technical scheme adopts the simulation of fusing the channel characteristics and the aggregation network space characteristics in the non-local neural network so as to improve the local characteristics and the detection precision;
3. to prevent overfitting, the newly introduced Dropout layer can randomly eliminate some nodes in the neural network and also remove all routes connecting these nodes to reduce the complexity of the network.
The method is based on the improvement of a C3D model, and a local neural network model is fused in a 3D convolution part, so that the problem of long-distance dependence of a video frame is solved, the understanding of characteristic information is enhanced, and the detection accuracy is improved; and Dropout layer calculation is added in each layer, so that the calculation amount is reduced, overfitting is prevented, and the detection speed is improved.
Drawings
FIG. 1 is a flow chart of an embodiment;
FIG. 2 is a schematic diagram of a non-local neural network according to an embodiment;
FIG. 3 is a schematic diagram of a NL-C3D network in an embodiment.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples, but the invention is not limited thereto.
The embodiment is as follows:
referring to fig. 1, the method for detecting abnormal behaviors of tower crane operators based on the NL-C3D model includes the following steps: 1) Collecting a monitoring video data set related to the operation process of tower crane operators, and dividing the video data set into a training set, a verification set and a test set;
2) Dividing video data into image frames by an algorithm, then cutting the image size of the image frames to ensure the size of the images to be consistent, and keeping a corresponding number of image frame samples, specifically, adjusting the imported video frames to a size where shape is [10,16,112,112,3], wherein 16 represents frame _ length, which means the training size of each sample is 16 frames, 112 represents crop _ size of the images, which means the cut size of the video frames is 112 × 112 pixels, and 3 represents the number of input channels;
3) Fusing a non-local module in a C3D network, as shown in FIG. 2, to obtain an NL-C3D network model, as shown in FIG. 3;
4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier. The process of collecting the video data set in the step 1) is as follows: the method comprises the steps of shooting an operation video by a camera, wherein the resolution is 320 pixels by 240 pixels, the frame rate is more than 25 frames per second, then, intercepting one frame of a data set every four frames to obtain a frame image, obtaining 1620 pictures after interception, and dividing the data set into a training set, a verification set and a test set according to the ratio of 6.
The image size of the image frame is cut in the step 2), and the image frame samples with corresponding number are kept, and the method specifically comprises the following steps: in the input processing process, in order to enhance the safety and the precision of the model, firstly, image frames are cut into 112 × 112 pixels randomly, then the initial address of a video frame of a selected network is determined in the output video frame, then a network input video frame of sixteen frames is selected on the address through a sliding window, the size of the selected video frame is 3 × 16 × 112 × 112, meanwhile, the data enhancement processing is realized by random inversion and subtraction operation sequentially performed along three paths of the image frames RGB, and finally, the graphics are marked by using graphics marking software Labelimg, and abnormal behaviors marked as 'calling', 'smoking', 'playing mobile phone' and 'dozing' are marked.
The step 3) of fusing the non-local module in the C3D network to obtain the NL-C3D network model includes:
3.1 The original C3D network model adopts 3D convolution and 3D pooling as a main body, the main body is composed of 8 3D convolution layers with 64 channels, 128 channels, 256 channels, 512 channels, 5 3D pooling layers, 2 full-link layers and a softmax classifier, and when a convolution part is fused with a non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolution layers as a whole in a residual error connection mode, so that each convolution layer is fused with a non-local neural network module;
3.2 The shape of input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and the input X is fused in a C3D network modelAfter the non-local neural network module is input, X is respectively input into theta,
Figure BDA0003955864870000061
The convolution module of g, theta,
Figure BDA0003955864870000062
G is respectively corresponding to 1 multiplied by 1 convolution with the step length of 1, and then the output results of the convolution modules are subjected to matrix dimension change;
3.3 Is a) is a reaction of theta with
Figure BDA0003955864870000063
Obtaining a matrix of (C, C) by matrix addition of the dimension-variable calculation result, then completing normalization analysis by Softmax, and completing further matrix multiplication of the analyzed result and the calculation result after g branch dimension change;
3.4 And) the result obtained in the step 3.3) is subjected to dimension change and then input into a g convolution module, and finally the result is subjected to residual error addition with the result of the input X to obtain a C3D network model fused with the non-local neural network module, namely an NL-C3D network model.
The step of importing the image frames into the NL-C3D network model for training and checking in the step 4) comprises the following steps:
4.1 Transmitting video frames with the size of 3 × 16 × 112 × 112 into an NL-C3D network model, wherein the NL-C3D network hierarchy respectively comprises a 64-channel convolutional layer, a 128-channel convolutional layer, two 256-channel convolutional layers, two 512-channel convolutional layers, and two 512-channel convolutional layers, the layers above are respectively followed by a pooling layer, and then a full-link layer with two 2096 dimensions and a softmax layer, and finally outputting dimension information of [10, n ], wherein n is the category number of a data set used for training;
4.2 After fusing a non-local neural network to a convolutional layer, the NL-C3D network model enhances the local characteristics of the target by compressing the channel characteristics and aggregating global spatial features, first, statistics is performed on the similarity values between the pixel points at the current position and all the pixel points in the feature map, then feature weighted summation is performed on the region where the similarity values exist, so as to increase the feature information of the region, and further realize the effect of global characteristic improvement, non-local operation performs weighted summation on the value of a certain region and all the feature information of feature mapping, as shown in formula (1):
Figure BDA0003955864870000071
wherein x and y respectively represent input features and output features, which are equivalent to feature images in graphics and videos, and both have the same dimension, i represents the current position code of a feature point, and j represents the codes of other feature points in the feature images; function f (x) i ,x j ) Then represents x i And x j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) i ,x j ) Is a gaussian function, as shown in equation (2):
Figure BDA0003955864870000072
the normalization factor C (x) is expressed as shown in equation (3):
Figure BDA0003955864870000073
from the formula (1), since the non-local operation considers the relationship between the current address and all the positions in the characteristic position, it can effectively capture the multi-position dependence relationship of the video frame, and the connection between the convolution layer and the non-local neural network adopts a residual connection structure, when the non-local operation is implemented specifically, it will be converted into the form of matrix multiplication and convolution operation, after various operations and conversions, the output characteristic dimension Z has the same dimension as the input X, so it can be directly added to each convolution module of the network without modifying the network, and the convolution part with the non-local neural network can be defined as shown in the formula (4):
Z i =W z Y i +X i (4),
wherein Y is i Obtained by the operation of formula (1), W z Then is the weight matrix,' + X i Representing residual connection, and obtaining the space-time characteristics of the video without interfering the original parameters and the initialization method in the model by using the residual connection mode;
4.3 In the convolution aspect, a three-dimensional convolution is used, which can perform convolution operation on adjacent frames to obtain information from the space and time dimensions of the video, so that the space data information and the time data information can be kept, compared with 2D convolution, one depth dimension is added in an input image, a convolution kernel is added by one dimension, and the input size is 3 × 16 × 112 × 112 for multiple channels;
4.4 To prevent overfitting, a Dropout layer is introduced into each layer of the NL-C3D network model, nodes in the neural network are eliminated randomly, lines connected with the nodes are deleted, complexity of the network is reduced, the Dropout rate of the model is rho, and the reserved probability is 1-rho;
4.5 The loss function part is an index for measuring the training of the network structure on the data set, the larger the value is, the more errors are, the loss function is used as a reference standard, and a cross entropy loss function is used in the model, as shown in formula (5):
H(p,q)=-∑p(x)log q(x) (5),
reflecting the difficulty degree of probability distribution p through probability distribution q, wherein p represents the probability of correct answers, q represents a predicted value, and the smaller the cross entropy is, the smaller the difference of the distribution values of the two probabilities is;
on the basis, the value probability of each class is obtained by using a Softmax function, and the Softmax function is shown as an equation (6):
Figure BDA0003955864870000081
wherein S is expressed as a classification probability score of M for each result, and each result is obtained through MAverage score S of individual categories 1 ,S 2 ,...,S M When the estimation is carried out, the score of a certain category is divided by the sum of the scores of various indexes to obtain the actual category with the minimum loss on the basis, and then the probability of the category is the maximum, so that the classification result is obtained.
Performance evaluation:
the results of comparing the NL-C3D network model with the C3D network model using the same data set in the same experimental environment with the accuracy and the elapsed time as evaluation indexes are shown in table 1:
TABLE 1 comparison of Performance before and after improvement of the model
Network model Rate of accuracy Elapsed time/s
C3D 0.72 268
NL-C3D 0.75 237
From the above table, it can be seen that the NL-C3D model improves both the accuracy and the time consumption of the detection, because the 3D convolution part fuses the local neural network model, the problem of long-distance dependence of the video frame is solved, the understanding of the feature information is enhanced, and the accuracy of the detection is improved; and Dropout layer calculation is added in each layer, so that the calculation amount is reduced, overfitting is prevented, and the recognition speed is improved.

Claims (5)

1. A tower crane operator abnormal behavior detection method based on an NL-C3D model is characterized by comprising the following steps:
1) Collecting a monitoring video data set related to the operation process of tower crane operators, and dividing the video data set into a training set, a verification set and a test set;
2) Dividing video data into image frames through an algorithm, then cutting the image size of the image frames to ensure the size of the images to be consistent, and keeping a corresponding number of image frame samples;
3) Fusing a non-local module in a C3D network to obtain an NL-C3D network model;
4) And 3) sequentially importing the image frame data set in the step 2) into the NL-C3D network model according to the sequence of the training set, the verification set and the test set for training and checking, and then obtaining a final result by using a softmax classifier.
2. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the video data set collection process in the step 1) is as follows: taking an operation video by a camera, wherein the resolution is greater than 320 pixels × 240 pixels, the frame rate is greater than 25 frames/second, then, cutting one frame of the data set every four frames to obtain a frame image, manually reducing the sampling step size of the image until meeting the requirements of at least sixteen frames for equipment which cannot make the entering time width of the network structure exceed sixteen frames according to the interval number, and dividing the data set into a training set, a verification set and a test set according to the ratio of 2.
3. The NL-C3D model-based tower crane operator abnormal behavior detection method as claimed in claim 1, wherein the step 2) of cutting out the image size of the image frame and keeping the corresponding number of image frame samples comprises the specific steps of randomly cutting the image frame into 112 x 112 pixels in order to enhance the safety and precision of the model during the input processing, then determining the initial address of the video frame of the selected network in the output video frame, then selecting sixteen network input video frames through a sliding window at the address, wherein the size of the selected video frame is 3 x 16 x 112, simultaneously using random inversion and subtraction operations sequentially performed along the three paths of the RGB image frame to realize the processing of data enhancement, and finally, using a graphic marking software Labelimg to mark the graphic, and marking the graphic with abnormal behaviors of "calling", "smoking", "mobile phone playing" and "dozing".
4. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the step 3) of fusing a non-local module in a C3D network to obtain the NL-C3D network model comprises the following steps:
3.1 The original C3D network model adopts 3D convolution and 3D pooling as main bodies, each main body is composed of 8 channels of 64, 128, 256, 512, 3D convolution layers, 5 3D pooling layers, 2 full-connection layers and a softmax classifier, and when the convolution parts are fused with the non-local neural network, compared with the original C3D network model, the non-local network is fused with the convolution layers as a whole in a residual error connection mode, so that each convolution layer is fused with the non-local neural network module;
3.2 The shape of the input X of the C3D convolution module is T, H, W and C, wherein T is the number of channels of an image, H is the length of a video frame, W is the height of the video frame, and C is the length of the video frame, and after the non-local neural network module is fused in the C3D network model, the input X is respectively input with theta, W and C,
Figure FDA0003955864860000022
The convolution module of g, theta,
Figure FDA0003955864860000023
And g are respectively corresponding to 1 multiplied by 1 convolution with the step length of 1, and then the output results of the convolution modules are subjected to matrix dimension change;
3.3 Is a) is a reaction of theta with
Figure FDA0003955864860000024
Of (2) aObtaining a matrix of (C, C) by matrix addition of the dimension calculation result, then completing normalization analysis by Softmax, and completing further matrix multiplication of the analyzed result and the calculation result after g branch dimension change;
3.4 And) the result obtained in the step 3.3) is subjected to dimension change and then input into a g convolution module, and finally the result is subjected to residual error addition with the result of the input X to obtain a C3D network model fused with the non-local neural network module, namely an NL-C3D network model.
5. The NL-C3D model-based tower crane operator abnormal behavior detection method according to claim 1, wherein the step of importing the image frames into the NL-C3D network model for training and checking in the step 4) comprises the steps of:
4.1 Transmitting video frames with the size of 3 × 16 × 112 × 112 into an NL-C3D network model, wherein the NL-C3D network hierarchy respectively comprises a 64-channel convolutional layer, a 128-channel convolutional layer, two 256-channel convolutional layers, two 512-channel convolutional layers, and two 512-channel convolutional layers, the layers above are respectively followed by a pooling layer, and then a full-link layer with two 2096 dimensions and a softmax layer, and finally outputting dimension information of [10, n ], wherein n is the category number of a data set used for training;
4.2 After fusing a non-local neural network to a convolutional layer, the NL-C3D network model enhances the local characteristics of the target by compressing the channel characteristics and aggregating global spatial features, first, statistics is performed on the similarity values between the pixel points at the current position and all the pixel points in the feature map, then feature weighted summation is performed on the region where the similarity values exist, so as to increase the feature information of the region, and further realize the effect of global characteristic improvement, non-local operation performs weighted summation on the value of a certain region and all the feature information of feature mapping, as shown in formula (1):
Figure FDA0003955864860000021
where x, y represent input features and output features, respectively, corresponding to feature images in graphics and videoI represents the current position code of the characteristic point, and j represents the codes of other characteristic points in the characteristic image; function f (x) i ,x j ) Then represents x i And x j The correlation degree between the two is described, namely the smaller the f value is, the smaller the interference degree of j to i is; g (x) j ) Is a linear combination function that provides the features of the graph at j; c (x) is a normalization parameter, f (x) i ,x j ) Is a gaussian function, as shown in equation (2):
Figure FDA0003955864860000031
the normalization factor C (x) is expressed as shown in equation (3):
Figure FDA0003955864860000032
from the formula (1), since the non-local operation considers the relationship between the current address and all the positions in the characteristic position, it can effectively capture the multi-position dependence relationship of the video frame, and the connection between the convolution layer and the non-local neural network adopts a residual connection structure, when the non-local operation is implemented specifically, it will be converted into the form of matrix multiplication and convolution operation, after various operations and conversions, the output characteristic dimension Z has the same dimension as the input X, so it can be directly added to each convolution module of the network without modifying the network, and the convolution part with the non-local neural network can be defined as shown in the formula (4):
Z i =W z Y i +X i (4),
wherein Y is i Obtained by the operation of formula (1), W z Then is the weight matrix,' + X i The method comprises the following steps that residual connection is represented, and the space-time characteristics of the video are obtained in the residual connection mode without interfering with the original parameters and the initialization method in a model;
4.3 In the convolution aspect, a three-dimensional convolution is used, which can perform convolution operation on adjacent frames to obtain information from the space and time dimensions of the video, so that the space data information and the time data information can be kept, compared with 2D convolution, one depth dimension is added in an input image, a convolution kernel is added by one dimension, and the input size is 3 × 16 × 112 × 112 for multiple channels;
4.4 To prevent overfitting, a Dropout layer is introduced into each layer of the NL-C3D network model, nodes in the neural network are eliminated randomly, lines connected with the nodes are deleted, complexity of the network is reduced, the Dropout rate of the model is rho, and the reserved probability is 1-rho;
4.5 The loss function part is an index for measuring the training of the network structure on the data set, the larger the value is, the more errors are, the loss function is used as a reference standard, and a cross entropy loss function is used in the model, as shown in formula (5):
H(p,q)=-∑p(x) log q(x) (5),
reflecting the difficulty degree of probability distribution p through probability distribution q, wherein p represents the probability of correct answers, q represents a predicted value, and the smaller the cross entropy is, the smaller the difference of the distribution values of the two probabilities is;
on the basis, the value probability of each class is obtained by using a Softmax function, and the Softmax function is shown as an equation (6):
Figure FDA0003955864860000041
wherein S represents the classification probability score of M for each result, and the average score S of each category is obtained through M 1 ,S 2 ,...,S M When the estimation is carried out, a certain category score is divided by the sum of various index scores to obtain an actual category with minimum loss on the basis, and then the probability of the category is maximum, so that a classification result is obtained.
CN202211462437.8A 2022-11-22 2022-11-22 Tower crane operator abnormal behavior detection method based on NL-C3D model Pending CN115761888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211462437.8A CN115761888A (en) 2022-11-22 2022-11-22 Tower crane operator abnormal behavior detection method based on NL-C3D model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211462437.8A CN115761888A (en) 2022-11-22 2022-11-22 Tower crane operator abnormal behavior detection method based on NL-C3D model

Publications (1)

Publication Number Publication Date
CN115761888A true CN115761888A (en) 2023-03-07

Family

ID=85334576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211462437.8A Pending CN115761888A (en) 2022-11-22 2022-11-22 Tower crane operator abnormal behavior detection method based on NL-C3D model

Country Status (1)

Country Link
CN (1) CN115761888A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117253293A (en) * 2023-11-15 2023-12-19 江西师范大学 Behavior recognition method, system, storage medium and computer equipment
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117253293A (en) * 2023-11-15 2023-12-19 江西师范大学 Behavior recognition method, system, storage medium and computer equipment
CN117975376A (en) * 2024-04-02 2024-05-03 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network
CN117975376B (en) * 2024-04-02 2024-06-07 湖南大学 Mine operation safety detection method based on depth grading fusion residual error network

Similar Documents

Publication Publication Date Title
CN110378381B (en) Object detection method, device and computer storage medium
CN108447080B (en) Target tracking method, system and storage medium based on hierarchical data association and convolutional neural network
CN111814902A (en) Target detection model training method, target identification method, device and medium
CN111709313B (en) Pedestrian re-identification method based on local and channel combination characteristics
CN112991278B (en) Method and system for detecting Deepfake video by combining RGB (red, green and blue) space domain characteristics and LoG (LoG) time domain characteristics
CN108961308B (en) Residual error depth characteristic target tracking method for drift detection
CN117079139B (en) Remote sensing image target detection method and system based on multi-scale semantic features
CN113192124B (en) Image target positioning method based on twin network
CN114627447A (en) Road vehicle tracking method and system based on attention mechanism and multi-target tracking
CN112966574A (en) Human body three-dimensional key point prediction method and device and electronic equipment
CN111242026B (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN115761888A (en) Tower crane operator abnormal behavior detection method based on NL-C3D model
CN109165698A (en) A kind of image classification recognition methods and its storage medium towards wisdom traffic
CN111738114A (en) Vehicle target detection method based on anchor-free accurate sampling remote sensing image
CN114782997A (en) Pedestrian re-identification method and system based on multi-loss attention adaptive network
CN115311502A (en) Remote sensing image small sample scene classification method based on multi-scale double-flow architecture
CN111507467A (en) Neural network model training method and device, computer equipment and storage medium
CN114202671B (en) Image prediction optimization processing method and device
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN115330833A (en) Fruit yield estimation method with improved multi-target tracking
CN114155278A (en) Target tracking and related model training method, related device, equipment and medium
CN114492755A (en) Target detection model compression method based on knowledge distillation
CN112085164B (en) Regional recommendation network extraction method based on anchor-free frame network
CN114743257A (en) Method for detecting and identifying image target behaviors
CN111582057B (en) Face verification method based on local receptive field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination