CN112949572A - Slim-YOLOv 3-based mask wearing condition detection method - Google Patents
Slim-YOLOv 3-based mask wearing condition detection method Download PDFInfo
- Publication number
- CN112949572A CN112949572A CN202110330611.2A CN202110330611A CN112949572A CN 112949572 A CN112949572 A CN 112949572A CN 202110330611 A CN202110330611 A CN 202110330611A CN 112949572 A CN112949572 A CN 112949572A
- Authority
- CN
- China
- Prior art keywords
- mask
- convolution
- network
- slim
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Geometry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained Slim-YOLOv3 model, and judging whether the user wears the mask correctly; according to the method, the Slim-YOLOv 3-based video detection method for the wearing condition of the mask is adopted, and an improved unsupervised self-classification method is adopted to perform subclass classification on data of the mask which is worn irregularly, so that the task of detecting the wearing condition of the mask video can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced.
Description
Technical Field
The invention belongs to the technical field of deep learning target detection and computer vision, and particularly relates to a mask wearing condition detection method based on Slim-YOLOv 3.
Background
Harmful gas, smell, spray, virus and the like can invade the human body through air, and the substance can be effectively prevented from invading the human body through the standard wearing of the mask. The effect of the standard wearing mask is not only to prevent the virus from spreading to other people from asymptomatic person, reduce the probability of secondary spreading to protect other people, but also can protect the wearer simultaneously, reduce the inoculation amount of the virus that the wearer contacted for the virus infection risk is lower.
In recent years, deep learning has been greatly advanced in the fields of object detection, image classification, semantic segmentation, and the like. Various algorithms combined with the convolutional neural network make great progress in both precision and operation speed. The mask wearing condition video detection task is a target detection problem, and the target detection is a multi-task deep learning problem combining target classification and target positioning.
At present, according to the requirement of an actual detection task, two key technologies need to be solved for video detection:
(1) real-time; the wearing condition of the mask of the current task object can be effectively captured only by ensuring real-time field video detection;
(2) high precision; only the wearing condition of the mask of the current object can be accurately obtained, and the effective auxiliary effect can be achieved.
At present, although many mask wearing condition video detection devices are already in practical application, the devices with high detection precision often consume expensive computing resources, and the cheap detectors cannot achieve high precision.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a mask wearing condition detection method based on Slim-YOLOv3, which comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network Darknet-53, a feature enhancement and prediction network and a decoding network;
the process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: classifying and re-labeling the data in the training sample set and the test sample set;
s3: inputting the classified training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for prediction, optimizing the loss function of the model according to the prediction result, and finishing the training of the model when the change of the loss function is small or the iteration times are reached.
Preferably, the preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.
Preferably, the process of classifying the data in the training sample set and the test sample set includes: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.
Further, the process of reclassifying the irregular mask wearing map by using the improved image unsupervised self-classification method SCAN comprises the following steps:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and step 3: leading the pre-training weight into a countermeasure network of ECAResnet50, and extracting high-level semantic features of the image;
and 4, step 4: calculating cosine similarity of each high-level semantic feature, and dividing the image corresponding to the semantic feature with higher similarity into neighbors;
and 5: performing cluster learning by using a nearest neighbor as a priori;
step 6: and carrying out fine-tuning marking processing on the clustering learned image by adopting a self-labeling label to obtain pseudo labels of four categories.
Further, a formula for calculating cosine similarity of the high-level semantic features is as follows:
preferably, the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADarknet-53 includes: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.
Further, the process of processing the feature by the attention mechanism ECANet module comprises the following steps: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels. The ECANet module is added to each of the five residual convolution blocks in ECADarknet-53, each resulting from the superposition of two convolutions with the output and input of one ECA module.
Preferably, the processing of the multiple scale features by using the feature enhancement and prediction network includes:
step 1: performing five times of convolution processing on the features obtained by a fifth residual convolution block in the ECADarknet53 network;
step 2: performing 3 × 3 convolution once and 1 × 1 convolution once again on the feature subjected to the convolution processing, and taking the processed result as the prediction result of the scale feature layer corresponding to the fifth residual convolution block;
and step 3: performing deconvolution UmSamplling 2d operation on the features after the five times of convolution processing, stacking the features with a feature layer obtained by a fourth residual convolution block, and fusing and enhancing information of two scale features;
and 4, step 4: performing five times of convolution processing on the fusion feature map, and performing one time of 3 × 3 convolution and one time of 1 × 1 convolution on the feature map subjected to five times of convolution to obtain a prediction result of a scale feature layer corresponding to the block subjected to the fourth residual convolution;
and 5: and outputting the prediction results of the feature layers of two scales, wherein the prediction result of each scale comprises a prediction frame corresponding to each grid point of two prior frames and the type of the prediction frame, namely, the positions, confidence degrees and the types of the three prior frames on each grid point after the two feature layers are respectively divided into grids with different sizes corresponding to the picture.
Preferably, the process of inputting the classification prediction result into the decoding network for decoding includes:
step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame;
step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame;
and step 3: calculating positioning loss through the position information and the actual labeling information, and calculating classification loss through the prediction category information and the actual labeling category information;
and 5: judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
step 6: calculating the coincidence degree of the real frame and the prior frame, and selecting the prior frame with the highest coincidence degree for verification;
and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
Preferably, the loss function of the model is expressed as:
preferably, the model introduces pre-training weights in the process of classifying the types of the data, firstly freezes parameters of a subsequent network layer of the backbone network to perform 50 times of iterative training, then unfreezes the parameters to perform 100 times of iterative training, and takes the weight with lower classification loss and total loss as a final training result.
The invention has the beneficial effects that:
according to the invention, the mask wearing condition video detection method based on YOLOv3 is adopted, so that the mask wearing video detection task can be realized more accurately and rapidly. And the proposed network is more concise, so that the application cost is further reduced. By further subclassing the data set and adding an ECANet attention mechanism module in a backbone network, the detection precision of the network is improved; by deleting the network feature layer of the minimum scale in the YOLOv3, the network is more focused on the targets of the large and medium scales, and the network detection speed is further improved.
Drawings
FIG. 1 is a diagram illustrating an example of the division of three major classes of data sets in the present invention;
FIG. 2 is a diagram illustrating the subclassing of the mask worn by a wearer without specification;
FIG. 3 is a diagram of the primary network structure of the original YOLOv3 in the present invention;
FIG. 4 is a network architecture diagram of the ECANet in the present invention;
FIG. 5 is a diagram of the main network structure of the mask wearing video detection task proposed by the present invention;
fig. 6 is a diagram of a video transmission and display device of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A mask wearing condition detection method based on Slim-YOLOv3 comprises the following steps: acquiring face video data in real time, and preprocessing the face video data; and inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly. The improved Slim-YOLOv3 model includes a backbone network Darknet-53, a feature enhancement and prediction network, and a decoding network.
The process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: carrying out initial classification on data in a training sample set;
s3: inputting the classified images in the training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for measurement, optimizing the loss function of the model according to the measurement result, and finishing the training of the model when the loss function is minimum.
An embodiment of training an improved Slim-Yolov3 model, comprising:
s1: acquiring an original data set, and classifying the original data set, wherein the classification result comprises the following steps: the mask wearing picture is standardized, the mask wearing picture is not standardized, and the mask wearing picture is not standardized;
s2: dividing the classified data set to obtain a training sample set and a testing sample set; carrying out data enhancement processing on the training sample set;
s3: inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting multi-scale classification features and positioning features;
s4: two feature layers are required to be output, the two feature layers are located at different positions of a trunk part darknet53 and are respectively located at a middle lower layer and a bottom layer, the sizes of the two feature layers are respectively (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the two feature layers;
s5: after the processing, a part of the feature layer with the size of 13 × 13 is used for outputting a prediction result corresponding to the feature layer, and a part of the feature layer is used for combining with the feature layer with the size of 26 × 26 after deconvolution UmSamplling 2d, and then, performing convolution processing of 3 × 3 and 1 × 1 on feature maps with two scales once;
s6: for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; two scales are required, predicting C classes, and the resulting tensor for each scale is K × [2 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.
S7: and (5) decoding. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:
1. judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
2. judging which prior frame has the highest coincidence degree with the real frame;
3. calculating how much prediction result should be obtained for the grid point to obtain a real frame;
4. all real frames are processed as above;
5. and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
S8: and according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.
As shown in fig. 1, the acquired data are classified into three major categories, i.e., non-wearing mask, non-Standard wearing mask, and Standard wearing mask, according to the actual wearing condition of the mask. Labels are Nomask, Wrmask and Swmask, respectively.
Further, the mask is worn irregularly, so that large inter-class difference exists, the detection precision of the class is influenced, and the overall detection precision is influenced. The data set, which is not a standard respirator, is subdivided into four subclasses labeled Notnorm1, Notnorm2, Notnorm3, and Notnorm 4.
The process of classifying the mask images which are worn out of the specification by adopting an improved image unsupervised self-classification method SCAN comprises the following steps:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: extracting high-level semantic features of the image through a self-supervision method, and eliminating low-level features in the current end-to-end learning method;
and step 3: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and 4, step 4: introducing the pre-training weight into a confrontation network formed by ECAResnet50 and a multilayer perceptron, calculating cosine similarity of each high-level semantic feature through the high-level semantic features of the images extracted by the ECAResnet50 network, and dividing the images corresponding to the semantic features with larger similarity into neighbors according to the similarity;
and 5: and performing cluster learning by taking the nearest neighbor as a priori. By learning a clustered function phiηRepresenting the prediction class corresponding to the target, η representing the neural network weight parameter, for a sample X in the data set D and its neighbor set NXPseudo label assignment is done together, threshold assignment over C classes, and the probability that sample X is assigned to the C-th class is expressed asLearning a function phi by an objective function lambdaηThe weight parameter of (2). The objective function Λ is as follows:
where D is the data set, X represents the sample, Φη(X) represents a clustering function, and eta represents a neural network weight parameterThe number of the first and second groups is,<·>is a dot product, and is represented by lambda,representing the probability of all samples belonging to a classC represents the confidence of the corresponding class of the target. To let the sample XiAnd its neighborsAnd generating consistent prediction results, wherein the dot product result is the maximum only if the prediction results are the same class and are all one-hot results. In order to avoid classifying all samples into the same class, a penalty term is added, so that the prediction result is uniformly distributed to all classes. In a specific implementation, for the values of K in K neighbors, when K is 0, only the samples and their enhanced images are used, and when K > 1, the intra-class sample difference is considered, but a penalty term is also introduced because not all neighbors belong to the same class.
Step 6: fine-tuning was performed by self-labeling. That is, a high confidence prediction (p) is selectedmax1), a threshold value is required to be defined, samples with confidence degrees larger than the threshold value are selected, so that pseudo labels are obtained, then cross entropy loss can be calculated to update parameters, in order to avoid overfitting, the samples with strong enhancement are used as input, then the samples with the confidence degrees higher than the threshold value are continuously added to be samples with high confidence degrees, iteration is finished after a limited number of times, and four classes of pseudo labels are obtained from classification results.
The improved YOLOv3 was multi-scale trained on full images using the backbone network of Darknet-53. And (4) introducing a characteristic pyramid thought by utilizing a backbone network to extract multi-scale characteristics. And extracting three layers of characteristics with different scales for predicting that the box detects objects with different sizes. And (3) upsampling the feature layer with a smaller scale, converting the upsampled feature layer into the same size as the previous feature layer through deconvolution, and then splicing to obtain information among the three feature layers with different scales.
As shown in fig. 2, the structure of the original YOLOv3 model specifically includes:
1. backbone network Darknet-53. Inputting the images in the enhanced training sample set into a YOLOv3 network model of a backbone network Darknet-53 for multi-scale transformation, and extracting features of multiple scales through the backbone network Darknet-53;
2. feature enhancement and prediction networks. Three feature layers output by Darknet-53 are required as input, the three feature layers are positioned at different positions of the trunk part Darknet53 and are respectively positioned at the middle layer, the middle lower layer and the bottom layer, the sizes of the three feature layers are respectively (52, 256), (26, 512) and (13, 1024), and then 5 times of convolution processing is carried out on the three feature layers; after finishing processing, one part of the feature layer is used for outputting a prediction result corresponding to the feature layer, one part of the feature layer is used for combining with the previous layer after deconvolution UmSamplling 2d, and then each layer of features is subjected to convolution processing of 3 × 3 and 1 × 1; for a picture, if the picture is initially divided into K multiplied by K grids, K represents the number of the grids of the intercepted picture, and the larger the grid size is, the smaller the grid size is; three scales are required, predicting C classes, and the resulting tensor for each scale is K × [3 × (4+1+ C) ]. Where 4 and 1 denote x, y coordinate offsets x _ offset and y _ offset of the predicted position, width h and height w of the predicted target, and the prediction category of the target, respectively. The network uses a plurality of independent logistic regression classifiers for classification, and each classifier only judges whether the object appearing in the target frame belongs to the current label, namely, the classification is simple two-classification, so that the multi-label classification is realized.
3. And decoding the part. Adding x _ offset and y _ offset corresponding to each grid point, obtaining the result after adding the result which is the center of the prediction frame, then combining the prior frame with h and w to calculate the length and width of the prediction frame, then calculating the positioning loss through the position information and the actual marking information, and calculating the classification loss through the prediction type information and the actual marking type information, wherein the process is as follows:
1. judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
2. judging which prior frame has the highest coincidence degree with the real frame;
3. calculating how much prediction result should be obtained for the grid point to obtain a real frame;
4. all real frames are processed as above;
5. and obtaining the predicted result which the network should have, and comparing the predicted result with the actual predicted result.
And according to the classification loss and the positioning loss, finishing the training of the model when the loss converges to a certain degree and is not reduced or reaches a certain iteration number.
A backbone network named Darknet-53 performs multi-scale training on the full image. Darknet-53 includes five large Residual convolution blocks, which respectively include 1, 2, 8 and 4 small Residual convolution units, Darknet-53 uses Residual network Residual, the Residual convolution in Darknet53 is a convolution with 3X3 and 2 step length, then the convolution layer is stored, a convolution with 1X1 and a convolution with 3X3 are carried out again, and the result is added with the layer as the final result, the Residual network is characterized by easy optimization and can improve accuracy rate by increasing a considerable depth, the Residual units inside the Residual network use jump connection, and the gradient disappearance problem caused by increasing the depth in a deep neural network is relieved. Each convolution part of the darknet53 uses a specific darknetv Conv2D structure, l2 regularization is carried out during each convolution, and Batchnormalization and LeakyReLU are carried out after the convolution is completed. The normal ReLU sets all negative values to zero, and the leakage ReLU assigns a non-zero slope to all negative values. Mathematically, it can be expressed as:
wherein x isiIndicating the input of standardized data, aiIs a custom scaling value that can adjust the data to a value close to 0, yiRepresenting the output of the activation function.
A backbone network Darknet-53 is utilized to introduce a characteristic pyramid idea to extract multi-scale characteristics; extracting three layers of characteristics with different scales for predicting boxes and detecting objects with different sizes; and (3) upsampling the characteristic layer with a smaller scale, converting the upsampled characteristic layer into the same size as the previous characteristic layer through deconvolution, and then splicing. In this way, information between feature layers of three different scales can be obtained.
As shown in fig. 3, after the ECANet module utilizes the global average pooling aggregation convolution characteristic without dimension reduction, the size K of the convolution kernel is determined adaptively, then one-dimensional convolution is performed, and then channel attention is learned through the sigmoid function. Because the dependency relationship between all channels cannot be efficiently captured by using the visual channel feature, ECANet only considers the information exchange between the current channel and its k-neighborhood channel. The parameter of each channel is C, then the parameter is k C. The formula of the parameters of each channel is as follows:
wherein, ω isiIndicates the number of parameters of the ith channel,denotes yiA set of k neighbor channels, σ denotes an activation function, wjA weight parameter representing the jth neighbor channel,the jth neighbor of the ith channel characteristic of the table, k, represents the number of neighbor channels.
The strategy can be simply and quickly realized in a one-dimensional convolution mode, the kernel size is k, and the processing formula is as follows:
ω=σ(C1Dk(y)),
where C1D represents a one-dimensional convolution, the ECANet module calls the above formula to make the final number of parameters k.
And adding a lightweight attention mechanism ECANet module at the end of the Darknet-53 residual volume block to obtain an ECA _ Darknet-53 trunk for extracting two-scale fine-grained features. Because only large and mesoscale human faces need to be detected in the actual application scene, the features of the last two scales of the ECA _ Darknet-53 are extracted and then further feature processing is carried out. While larger objects can be further processed to accomplish new tasks.
As shown in fig. 3, in the ECANet attention mechanism module, a convolution block of size W × H × C, given the aggregate nature of using Global Average Pools (GAPs), generates channel weights by performing a fast one-dimensional convolution of size k, where k is adaptively determined by a function of the channel dimension C.
As shown in fig. 4, the main network structure of the mask wearing video detection task proposed by the present invention includes: the data enhancement network is used for enhancing the data of the training data set; ECADarknet-53 is obtained by adding an ECANet attention mechanism module at the end of each small residual convolution unit of Darknet-53, and the backbone network ECADarknet-53 introduced with the ECANet attention mechanism module can better improve the capability of network extraction of more relevant characteristics to tasks; deleting the feature layer with the smallest scale in three feature layers with different scales extracted from the subsequent original Yolov3 network, so that the network concentrates more on the target with large and medium scales, the network is more simplified, and the detection speed is improved, as shown in the two layers of features of fig. 4, after three large residual errors are extracted through a backbone network, only the last two features with large residual errors are output, the shape of the two feature layers are respectively (26, 512), (13, 1024), and the two feature layers contain the features with larger scales; and after 5 times of convolution processing is carried out on the last characteristic layer, one part of the last characteristic layer is used for outputting a prediction result corresponding to the characteristic layer, and after deconvolution is carried out on the last characteristic layer, UmSamplling 2d, the last characteristic layer is combined with the last characteristic layer, and 5 times of convolution processing is carried out to output a corresponding prediction result.
In one embodiment, the labels of the data sets of the mask worn by the non-standard mask are re-labeled by an improved SCAN image unsupervised self-classification method, subclassing is carried out to obtain a final training and testing data set, and the final detection model is obtained by carrying out training by a YOLOv3 self-contained data enhancement method and by combining transfer learning based on an improved YOLOv3 video detection method. The hardware equipment adopts a Haokangwei human body temperature measurement double-optical-tube machine (DS-2TD2637B-10) as image acquisition equipment, and the arrangement and control are simply installed by adopting a tripod in a matching way; a desktop computer with a display card GeForce GTX 1060Ti is combined. As shown in fig. 5, the video acquisition is equipped with a hard disk recorder and connected through a switch to realize data transmission.
The expression of the loss function of the model is:
wherein λ iscoordAnd λnoobjAs a weight of the corresponding term, S2Representing the number of meshes, B representing the number of candidate frames generated per mesh,whether the anchor box of the jth deep learning objective checking algorithm representing the ith mesh is responsible for predicting this object,the jth anchor box representing the ith mesh is not responsible for predicting this object, xiAbscissa, y, representing the actual center point of the ith meshiThe ordinate representing the actual center point of the ith mesh,the y anchor box of the ith trellis predicts and decodes the horizontal coordinate of the center point,represents the vertical coordinate of the central point of the ith grid after being predicted and decoded by the jth anchor box, and respectively represents the width and the height of the target by omega and h,respectively representing the width and the height of the decoded target, C representing the confidence level of the target object contained in the target prediction frame,representing the confidence of the target object contained in the decoded target prediction frame, classes representing all classes of the data set, P representing the probability that the target belongs to the class c,representing the probability of interpreting the object as belonging to class c.
The final embodiment can realize the fast and accurate wearing condition identification and body temperature monitoring of the large and medium-scale face masks.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A mask wearing condition detection method based on Slim-YOLOv3 is characterized by comprising the following steps: acquiring face video data in real time, and preprocessing the face video data; inputting the preprocessed face image into a trained improved Slim-YOLOv3 model, and judging whether the user wears the mask correctly; the improved Slim-YOLOv3 model comprises a backbone network ECADArknet-53, a feature enhancement and prediction network and a decoding network;
the process of training the improved Slim-Yolov3 model includes:
s1: acquiring an original data set, and preprocessing the original data set to obtain a training sample set and a test sample set;
s2: classifying and re-labeling the data in the training sample set and the test sample set;
s3: inputting the classified training sample set into a backbone network Darknet-53 for multi-scale transformation, and extracting a plurality of scale features;
s4: inputting a plurality of scale features into a feature enhancement and prediction network to obtain a classification prediction result;
s5: inputting the classification prediction result into a decoding network for decoding;
s6: calculating a loss function of the model according to the decoding result;
s7: and inputting the data in the test set into the model for prediction, optimizing the loss function of the model according to the prediction result, and finishing the training of the model when the change of the loss function is small or the iteration times are reached.
2. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein preprocessing the raw data set comprises: compressing and turning data in the original data set and changing the brightness of the image to obtain enhanced image data; and segmenting the enhanced image data to obtain a training sample set and a test sample set.
3. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of classifying the data in the training sample set and the test sample set comprises: dividing the face mask wearing conditions into three categories according to the images of the original data set, including standard mask wearing images, non-standard mask wearing images and non-mask wearing images; and adopting an improved image unsupervised self-classification method SCAN to reclassify the mask images which are not worn normally, and obtaining a plurality of subclasses.
4. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 3, wherein the process of classifying the irregular wearing mask pattern by using the improved image unsupervised self-classification method SCAN comprises:
step 1: extracting a face area which is not standard for wearing the mask in the mask data set as a training set;
step 2: carrying out classification training on the mask wearing condition data set face region data by adopting an ECAResnet50 network to obtain a pre-training weight;
and step 3: leading the pre-training weight into a countermeasure network of ECAResnet50, and extracting high-level semantic features of the image;
and 4, step 4: calculating cosine similarity of each high-level semantic feature, and dividing the image corresponding to the semantic feature with higher similarity into neighbors;
and 5: performing cluster learning by using a nearest neighbor as a priori;
step 6: and carrying out fine-tuning marking processing on the clustering learned image by adopting a self-labeling label to obtain pseudo labels of four categories.
5. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 4, wherein the formula for calculating cosine similarity of high-level semantic features is as follows:
wherein x isiAnd yiThe ith dimension in the vector representing the two semantic features respectively, and n represents the total dimension of the vector.
6. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 5, wherein the process of extracting the multi-scale features of the classified images in the training sample set by using the backbone network ECADArknet-53 comprises: inputting the image into a data enhancement module, and adjusting the image to 416 × 3; inputting the adjusted image into an ECADarknet53 network, and performing primary convolution dimensionality increase on the image by adopting a convolution block to obtain an image with the size of batch _ size 416 32; extracting the characteristics of the graph after dimension increase of the convolution by adopting five residual volume blocks of an attention mechanism ECANet module, wherein the extracted characteristic scale is increased after each residual volume block passes through, and finally outputting two characteristic layers obtained by a fourth residual volume block and a fifth residual volume block; where batch _ size represents the number of images input to the network at a time.
7. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 6, wherein the process of processing the characteristics by the attention mechanism ECANet module comprises: performing global average pooling operation on the feature layer by adopting a channel level without reducing the dimension; selecting data of k adjacent channels for each channel, performing 1 × 1 convolution on the data subjected to the global average pooling operation, and activating a function through a sigmod; and expanding the activated data to the size of the input features and multiplying the input features by the activated data to obtain the enhanced features containing the information of the plurality of channels.
8. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of processing the multiple scale features by using the feature enhancement and prediction network comprises:
step 1: performing five times of convolution processing on the features obtained by a fifth residual convolution block in the ECADarknet53 network;
step 2: performing 3 × 3 convolution once and 1 × 1 convolution once again on the feature subjected to the convolution processing, and taking the processed result as the prediction result of the scale feature layer corresponding to the fifth residual convolution block;
and step 3: performing deconvolution UmSamplling 2d operation on the features after the five times of convolution processing, stacking the features with a feature layer obtained by a fourth residual convolution block, and fusing and enhancing information of two scale features;
and 4, step 4: performing five times of convolution processing on the fusion feature map, and performing one time of 3 × 3 convolution and one time of 1 × 1 convolution on the feature map subjected to five times of convolution to obtain a prediction result of a scale feature layer corresponding to the block subjected to the fourth residual convolution;
and 5: and outputting the prediction results of the feature layers of two scales, wherein the prediction result of each scale comprises a prediction frame corresponding to each grid point of two prior frames and the type of the prediction frame, namely, the positions, confidence degrees and the types of the three prior frames on each grid point after the two feature layers are respectively divided into grids with different sizes corresponding to the picture.
9. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the process of inputting the classification prediction result into a decoding network for decoding comprises:
step 1: adding the corresponding x _ offset and y _ offset to each grid point to obtain the center of the prediction frame; wherein x _ offset and y _ offset represent the grid upper left coordinate (x, y) and the offset in the x and y directions of the actual predicted point, respectively;
step 2: combining the prior frame with h and w, and calculating the length and width of the prediction frame; wherein h and w respectively represent the scaling values of the prediction frame;
and step 3: calculating positioning loss through the position information and the actual labeling information, and calculating classification loss through the prediction category information and the actual labeling category information;
and 5: judging the position of the real frame in the picture, and judging which grid point the real frame belongs to for detection;
step 6: calculating the coincidence degree of the real frame and the prior frame, and selecting the prior frame with the highest coincidence degree for verification;
and 7: and obtaining the predicted result which the network should have, and comparing the predicted result with the actual labeled result.
10. The mask wearing condition detection method based on Slim-YOLOv3 as claimed in claim 1, wherein the loss function of the model has the expression:
wherein λ iscoordAnd λnoobjAs a weight of the corresponding term, S2Representing the number of meshes, B representing the number of candidate frames generated per mesh,whether the anchor box of the jth deep learning objective checking algorithm representing the ith mesh is responsible for predicting this object,the jth anchor box representing the ith mesh is not responsible for predicting this object, xiAbscissa, y, representing the actual center point of the ith meshiThe ordinate representing the actual center point of the ith mesh,the y anchor box of the ith trellis predicts and decodes the horizontal coordinate of the center point,represents the vertical coordinate of the central point of the ith grid after being predicted and decoded by the jth anchor box, and respectively represents the width and the height of the target by omega and h,respectively representing the width and the height of the decoded target, C representing the confidence level of the target object contained in the target prediction frame,representing the confidence of the target object contained in the decoded target prediction frame, classes representing all classes of the data set, P representing the probability that the target belongs to the class c,representing the probability of interpreting the object as belonging to class c.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110330611.2A CN112949572B (en) | 2021-03-26 | 2021-03-26 | Slim-YOLOv 3-based mask wearing condition detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110330611.2A CN112949572B (en) | 2021-03-26 | 2021-03-26 | Slim-YOLOv 3-based mask wearing condition detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112949572A true CN112949572A (en) | 2021-06-11 |
CN112949572B CN112949572B (en) | 2022-11-25 |
Family
ID=76227145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110330611.2A Active CN112949572B (en) | 2021-03-26 | 2021-03-26 | Slim-YOLOv 3-based mask wearing condition detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949572B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516194A (en) * | 2021-07-20 | 2021-10-19 | 海南长光卫星信息技术有限公司 | Hyperspectral remote sensing image semi-supervised classification method, device, equipment and storage medium |
CN113553984A (en) * | 2021-08-02 | 2021-10-26 | 中再云图技术有限公司 | Video mask detection method based on context assistance |
CN113553936A (en) * | 2021-07-19 | 2021-10-26 | 河北工程大学 | Mask wearing detection method based on improved YOLOv3 |
CN113762201A (en) * | 2021-09-16 | 2021-12-07 | 深圳大学 | Mask detection method based on yolov4 |
CN113963251A (en) * | 2021-11-26 | 2022-01-21 | 山东省计算中心(国家超级计算济南中心) | Marine organism detection method, system and equipment |
CN113989708A (en) * | 2021-10-27 | 2022-01-28 | 福州大学 | Campus library epidemic prevention and control method based on YOLO v4 |
CN114092998A (en) * | 2021-11-09 | 2022-02-25 | 杭州电子科技大学信息工程学院 | Face recognition detection method for wearing mask based on convolutional neural network |
CN114118061A (en) * | 2021-11-30 | 2022-03-01 | 深圳市北科瑞声科技股份有限公司 | Lightweight intention recognition model training method, device, equipment and storage medium |
CN114155453A (en) * | 2022-02-10 | 2022-03-08 | 深圳爱莫科技有限公司 | Training method for ice chest commodity image recognition, model and occupancy calculation method |
CN114283462A (en) * | 2021-11-08 | 2022-04-05 | 上海应用技术大学 | Mask wearing detection method and system |
CN114821702A (en) * | 2022-03-15 | 2022-07-29 | 电子科技大学 | Thermal infrared face recognition method based on face shielding |
CN114944001A (en) * | 2022-06-07 | 2022-08-26 | 杭州电子科技大学 | Mask monitoring and two-dimensional code recognition system design method based on Maixduino AI K210 development board |
CN116311104A (en) * | 2023-05-15 | 2023-06-23 | 合肥市正茂科技有限公司 | Training method, device, equipment and medium for vehicle refitting recognition model |
CN117975376A (en) * | 2024-04-02 | 2024-05-03 | 湖南大学 | Mine operation safety detection method based on depth grading fusion residual error network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2020101210A4 (en) * | 2020-06-30 | 2020-08-06 | Anguraj, Dinesh Kumar Dr | Automated screening system of covid-19 infected persons by measurement of respiratory data through deep facial recognition |
CN111626330A (en) * | 2020-04-23 | 2020-09-04 | 南京邮电大学 | Target detection method and system based on multi-scale characteristic diagram reconstruction and knowledge distillation |
CN111862408A (en) * | 2020-06-16 | 2020-10-30 | 北京华电天仁电力控制技术有限公司 | Intelligent access control method |
CN111881775A (en) * | 2020-07-07 | 2020-11-03 | 烽火通信科技股份有限公司 | Real-time face recognition method and device |
CN112183471A (en) * | 2020-10-28 | 2021-01-05 | 西安交通大学 | Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel |
-
2021
- 2021-03-26 CN CN202110330611.2A patent/CN112949572B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111626330A (en) * | 2020-04-23 | 2020-09-04 | 南京邮电大学 | Target detection method and system based on multi-scale characteristic diagram reconstruction and knowledge distillation |
CN111862408A (en) * | 2020-06-16 | 2020-10-30 | 北京华电天仁电力控制技术有限公司 | Intelligent access control method |
AU2020101210A4 (en) * | 2020-06-30 | 2020-08-06 | Anguraj, Dinesh Kumar Dr | Automated screening system of covid-19 infected persons by measurement of respiratory data through deep facial recognition |
CN111881775A (en) * | 2020-07-07 | 2020-11-03 | 烽火通信科技股份有限公司 | Real-time face recognition method and device |
CN112183471A (en) * | 2020-10-28 | 2021-01-05 | 西安交通大学 | Automatic detection method and system for standard wearing of epidemic prevention mask of field personnel |
Non-Patent Citations (2)
Title |
---|
JIANG XIAOMING ET AL: "YOLOv3_slim for face mask recognition", 《JOURNAL OF PHYSICS: CONFERENCE SERIES》 * |
肖俊杰: "基于YOLOv3和YCrCb的人脸口罩检测与规范佩戴识别", 《软件》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113553936A (en) * | 2021-07-19 | 2021-10-26 | 河北工程大学 | Mask wearing detection method based on improved YOLOv3 |
CN113516194B (en) * | 2021-07-20 | 2023-08-08 | 海南长光卫星信息技术有限公司 | Semi-supervised classification method, device, equipment and storage medium for hyperspectral remote sensing images |
CN113516194A (en) * | 2021-07-20 | 2021-10-19 | 海南长光卫星信息技术有限公司 | Hyperspectral remote sensing image semi-supervised classification method, device, equipment and storage medium |
CN113553984B (en) * | 2021-08-02 | 2023-10-13 | 中再云图技术有限公司 | Video mask detection method based on context assistance |
CN113553984A (en) * | 2021-08-02 | 2021-10-26 | 中再云图技术有限公司 | Video mask detection method based on context assistance |
CN113762201A (en) * | 2021-09-16 | 2021-12-07 | 深圳大学 | Mask detection method based on yolov4 |
CN113762201B (en) * | 2021-09-16 | 2023-05-09 | 深圳大学 | Mask detection method based on yolov4 |
CN113989708A (en) * | 2021-10-27 | 2022-01-28 | 福州大学 | Campus library epidemic prevention and control method based on YOLO v4 |
CN113989708B (en) * | 2021-10-27 | 2024-06-04 | 福州大学 | Campus library epidemic prevention and control method based on YOLO v4 |
CN114283462A (en) * | 2021-11-08 | 2022-04-05 | 上海应用技术大学 | Mask wearing detection method and system |
CN114283462B (en) * | 2021-11-08 | 2024-04-09 | 上海应用技术大学 | Mask wearing detection method and system |
CN114092998A (en) * | 2021-11-09 | 2022-02-25 | 杭州电子科技大学信息工程学院 | Face recognition detection method for wearing mask based on convolutional neural network |
CN113963251A (en) * | 2021-11-26 | 2022-01-21 | 山东省计算中心(国家超级计算济南中心) | Marine organism detection method, system and equipment |
CN114118061A (en) * | 2021-11-30 | 2022-03-01 | 深圳市北科瑞声科技股份有限公司 | Lightweight intention recognition model training method, device, equipment and storage medium |
CN114155453A (en) * | 2022-02-10 | 2022-03-08 | 深圳爱莫科技有限公司 | Training method for ice chest commodity image recognition, model and occupancy calculation method |
CN114821702A (en) * | 2022-03-15 | 2022-07-29 | 电子科技大学 | Thermal infrared face recognition method based on face shielding |
CN114944001A (en) * | 2022-06-07 | 2022-08-26 | 杭州电子科技大学 | Mask monitoring and two-dimensional code recognition system design method based on Maixduino AI K210 development board |
CN116311104B (en) * | 2023-05-15 | 2023-08-22 | 合肥市正茂科技有限公司 | Training method, device, equipment and medium for vehicle refitting recognition model |
CN116311104A (en) * | 2023-05-15 | 2023-06-23 | 合肥市正茂科技有限公司 | Training method, device, equipment and medium for vehicle refitting recognition model |
CN117975376A (en) * | 2024-04-02 | 2024-05-03 | 湖南大学 | Mine operation safety detection method based on depth grading fusion residual error network |
CN117975376B (en) * | 2024-04-02 | 2024-06-07 | 湖南大学 | Mine operation safety detection method based on depth grading fusion residual error network |
Also Published As
Publication number | Publication date |
---|---|
CN112949572B (en) | 2022-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112949572B (en) | Slim-YOLOv 3-based mask wearing condition detection method | |
CN111639692B (en) | Shadow detection method based on attention mechanism | |
CN110443143B (en) | Multi-branch convolutional neural network fused remote sensing image scene classification method | |
CN112801018B (en) | Cross-scene target automatic identification and tracking method and application | |
CN113361495B (en) | Method, device, equipment and storage medium for calculating similarity of face images | |
US20190236411A1 (en) | Method and system for multi-scale cell image segmentation using multiple parallel convolutional neural networks | |
CN100423020C (en) | Human face identifying method based on structural principal element analysis | |
CN107633226B (en) | Human body motion tracking feature processing method | |
CN113052185A (en) | Small sample target detection method based on fast R-CNN | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
US8094971B2 (en) | Method and system for automatically determining the orientation of a digital image | |
CN106599864A (en) | Deep face recognition method based on extreme value theory | |
CN110222636A (en) | The pedestrian's attribute recognition approach inhibited based on background | |
Teimouri et al. | A real-time ball detection approach using convolutional neural networks | |
CN111274964A (en) | Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle | |
CN116434010A (en) | Multi-view pedestrian attribute identification method | |
Putro et al. | Fast face-CPU: a real-time fast face detector on CPU using deep learning | |
CN102156879A (en) | Human target matching method based on weighted terrestrial motion distance | |
CN118212572A (en) | Road damage detection method based on improvement YOLOv7 | |
CN111738194A (en) | Evaluation method and device for similarity of face images | |
Gowda | Age estimation by LS-SVM regression on facial images | |
Nanthini et al. | A novel Deep CNN based LDnet model with the combination of 2D and 3D CNN for Face Liveness Detection | |
CN116824330A (en) | Small sample cross-domain target detection method based on deep learning | |
CN112287929A (en) | Remote sensing image significance analysis method based on feature integration deep learning network | |
Lulio et al. | Jseg algorithm and statistical ann image segmentation techniques for natural scenes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |