Face Mask Wearing Detection Algorithm Based on Improved YOLO-v4
<p>Flow chart of the proposed approach.</p> "> Figure 2
<p>YOLO-v4 network structure.</p> "> Figure 3
<p>CSPDarkNet53 module structure.</p> "> Figure 4
<p>CSP1_X module structure.</p> "> Figure 5
<p>CSP2_X module structure.</p> "> Figure 6
<p>Image scaling in YOLO-v4.</p> "> Figure 7
<p>Adaptive image scaling.</p> "> Figure 8
<p>Network model of mask detection.</p> "> Figure 9
<p>The process of object positioning and prediction.</p> "> Figure 10
<p>Division of key parts.</p> "> Figure 11
<p>Sample diagram from the data set.</p> "> Figure 12
<p>Visualization of different models in performance testing.</p> ">
Abstract
:1. Introduction
- Aiming at the problem of training time, this paper introduces the improved CSPDarkNet53 into the backbone to realize the rapid convergence of the model and reduce the time cost in training.
- An adaptive image scaling algorithm is introduced to reduce the use of redundant information in the model.
- To strengthen the fusion of multi-scale semantic information, the improved PANet is added into the Neck module.
- The Hard-Swish activation function introduced in this paper can not only strengthen the nonlinear feature extraction ability of the network, but also enable the detection results of the model to be more accurate.
2. Related Works
2.1. Problems Exist in Object Detection
2.2. Existing Work
3. The Model Structure of YOLO-v4 Network
4. Improved YOLO-v4 Network Model
4.1. Backbone Feature Extraction Network
4.2. Neck Network
4.3. Adaptive Image Scaling
Algorithm 1 Adaptive image scaling. |
Input: and are the width and height of the input image. |
and are the width and height of the object image of standard size. |
Begin |
|
if : |
End Output: |
4.4. Improved Network Model Structure
4.5. Object Location and Prediction Process
4.6. The Size Design of Prior Box
5. Experimental Data Set
5.1. Data Set
5.2. Region Division of Real Box
6. Experimental Results and Analysis
6.1. Experimental Platform and Parameters
6.2. The Performance of Different Models in Training
6.3. Comparision of Reasioning Time and Real-Time Performance
6.4. The Parameter Distuibution of Different Network Layers
6.5. Model Testing
6.6. Influence of Different Activation Functions
6.7. Analysis of Ablation Experiment
7. Conclusions
- Firstly, the CSP1_X module is introduced into the backbone feature extraction network to enhance feature extraction.
- Secondly, the CSP2_X module is used in the Neck module to ensure that the model can learn deeper semantic information in the process of feature fusion.
- Thirdly, the Hard-Swish activation function is used to improve the nonlinear feature learning ability of the model.
- Finally, the proposed adaptive image scaling algorithm can reduce the model’s reasoning time.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Alberca, G.G.F.; Fernandes, I.G.; Sato, M.N.; Alberca, R.W. What Is COVID-19? Front. Young Minds 2020, 8, 74. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Advances in Kernel Methods. Support Vector Learn. 1998, 208. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323. [Google Scholar]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. Comput. ENCE 2012, 3, 212–223. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware Trident Networks for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6053–6062. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Li, Z.; Zhou, F. FSSD: Feature Fusion Single Shot Multibox Detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
- Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef] [Green Version]
- Shin, H.C.; Roth, H.R.; Gao, M.; Le, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans. Med Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Giger, M.L.; Suzuki, K. Computer-aided diagnosis. In Biomedical Information Technology; Academic Press: Cambridge, MA, USA, 2008; pp. 359–374. [Google Scholar]
- Khan, M.A.; Kim, Y. Cardiac Arrhythmia Disease Classification Using LSTM Deep Learning Approach. Computers. Mater. Contin. 2021, 67, 427–443. [Google Scholar]
- Uijlings, J.R.R.; Sande, K.E.A.v.d.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Buciu, I. Color quotient based mask detection. In Proceedings of the 2020 International Symposium on Electronics and Telecommunications (ISETC), Timisoara, Romania, 5–6 November 2020; pp. 1–4. [Google Scholar]
- Loey, M.; Manogaran, G.; Taha, M.; Khalifa, N.E. Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 2020, 65, 102600. [Google Scholar] [CrossRef] [PubMed]
- Ml, A.; Gmb, C.; Mhnt, D.; Nemk, D. A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic. Measurement 2020, 167, 108288. [Google Scholar]
- Nagrath, P.; Jain, R.; Madan, A.; Arora, R.; Kataria, P.; Hemanth, J.D. SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 2020, 66, 102692. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Neubeck, A.; Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
- Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572. [Google Scholar]
- Avenash, R.; Viswanath, P. Semantic Segmentation of Satellite Images using a Modified CNN with Hard-Swish Activation Function. VISIGRAPP 2019. [Google Scholar] [CrossRef]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2018, arXiv:1710.05941. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Keys, R. Cubic convolution interpolation for digital image processing. In IEEE Transactions on Acoustics, Speech, and Signal Pro-Cessing; IEEE: Piscataway, NJ, USA, 1981; Volume 29, pp. 1153–1160. [Google Scholar] [CrossRef] [Green Version]
- Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]
- Wang, Z.-Y.; Wang, G.; Huang, B.; Xiong, Z.; Hong, Q.; Wu, H.; Yi, P.; Jiang, K.; Wang, N.; Pei, Y.; et al. Masked Face Recognition Dataset and Application. arXiv 2020, arXiv:2003.09093. [Google Scholar]
- Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net-A Dataset of Correctly/Incorrectly Masked Face Images in the Context of COVID-19. arXiv 2020, arXiv:2008.08016. [Google Scholar]
- Zhang, H.; Li, D.; Ji, Y.; Zhou, H.; Wu, W.; Liu, K. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines. IEEE Trans. Ind. Inform. 2020, 16, 7722–7731. [Google Scholar] [CrossRef]
- Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Feature Map | Receptive Field | Prior Box Size |
---|---|---|
13 × 13 | large object | (221 × 245) (234 × 229) (245 × 251) |
26 × 26 | medium object | (165 × 175) (213 × 222) (217 × 195) |
52 × 52 | small object | (46 × 51) (82 × 100) (106 × 201) |
Sort | Training Set | Validation Set | Testing Set | |||
---|---|---|---|---|---|---|
Images | Objects | Images | Objects | Images | Objects | |
face | 2556 | 2670 | 338 | 350 | 721 | 753 |
face_mask | 2685 | 2740 | 219 | 228 | 716 | 730 |
WMI | 2585 | 2604 | 311 | 311 | 724 | 730 |
total | 7826 | 8014 | 868 | 889 | 2161 | 2213 |
Device | Configuration |
---|---|
Operating system | Windows 10 |
Processor | Inter(R)i7-9700k |
GPU accelerator | CUDA 10.1, Cudnn 7.6 |
GPU | RTX 2070Super, 8G |
Frames | Pytorch, Keras, Tensorflow |
Compilers | Pycharm, Anaconda |
Scripting language | Python 3.7 |
Camera | A4tech USB2.0 Camera |
Hyperparameters | Before Initialization | After Initialization |
---|---|---|
initial learning rate | 0.01000 | 0.00320 |
optimizer weight decay | 0.00050 | 0.00036 |
momentum | 0.93700 | 0.84300 |
classification coefficient | 0.50000 | 0.24300 |
object coefficient | 1.00000 | 0.30100 |
hue | 0.01500 | 0.01380 |
saturation | 0.70000 | 0.66400 |
value | 0.40000 | 0.46400 |
scale | 0.50000 | 0.89800 |
shear | 0.00000 | 0.60200 |
mosaic | 1.00000 | 1.00000 |
mix-up | 0.00000 | 0.24300 |
flip up-down | 0.00000 | 0.00856 |
Model | Parameters | Model Size | Training Time |
---|---|---|---|
Proposed work | 45.2 MB | 91.0 MB | 2.834 h |
YOLO-v4 | 61.1 MB | 245 MB | 9.730 h |
YOLO-v3 | 58.7 MB | 235 MB | 8.050 h |
SSD | 22.9 MB | 91.7 MB | 3.350 h |
Faster R-CNN | 27.1 MB | 109 MB | 45.830 h |
Model | One Image Test Time | All Reasoning Time | FPS |
---|---|---|---|
Proposed work | 0.022 s | 144.7 s | 54.57 |
YOLO-v4 | 0.042 s | 151.1 s | 23.83 |
YOLO-v3 | 0.047 s | 153.1 s | 21.39 |
SSD | 0.029 s | 97.0 s | 34.69 |
Faster R-CNN | 0.410 s | 1620.7 s | 2.44 |
Module | Faster R-CNN | SSD | YOLO-v3 | YOLO-v4 | Proposed Work |
---|---|---|---|---|---|
Backbone | - | - | 40,620,740 | 30,730,448 | 9,840,832 |
Neck | - | - | 14,722,972 | 27,041,012 | 37,514,988 |
Prediction | - | - | 6,243,400 | 6,657,945 | 43,080 |
All parameters | 28,362,685 | 24,013,232 | 61,587,112 | 64,014,760 | 47,398,900 |
All CSPx | - | - | - | 26,816,384 | - |
All CSP1_X | - | - | - | - | 8,288,896 |
All CSP2_X | - | - | - | - | 18,687,744 |
All layers | 185 | 69 | 256 | 370 | 335 |
Models | Sort | Size | Object | TP | FP | FN | P | R | |
---|---|---|---|---|---|---|---|---|---|
Proposed work | face | 416 × 416 | 753 | 737 | 50 | 16 | 0.936 | 0.979 | 0.957 |
face_mask | 416 × 416 | 730 | 725 | 23 | 5 | 0.969 | 0.993 | 0.980 | |
WMI | 416 × 416 | 730 | 712 | 39 | 18 | 0.948 | 0.975 | 0.961 | |
Total | 416 × 416 | 2213 | 2174 | 112 | 39 | 0.951 | 0.982 | 0.967 | |
YOLO-v4 | face | 416 × 416 | 753 | 666 | 42 | 87 | 0.941 | 0.885 | 0.910 |
face_mask | 416 × 416 | 730 | 705 | 199 | 25 | 0.780 | 0.966 | 0.860 | |
WMI | 416 × 416 | 730 | 670 | 195 | 60 | 0.775 | 0.918 | 0.840 | |
Total | 416 × 416 | 2213 | 2041 | 436 | 172 | 0.832 | 0.923 | 0.870 | |
YOLO-v3 | face | 416 × 416 | 753 | 640 | 53 | 113 | 0.924 | 0.850 | 0.890 |
face_mask | 416 × 416 | 730 | 686 | 23 | 44 | 0.968 | 0.940 | 0.950 | |
WMI | 416 × 416 | 730 | 623 | 26 | 107 | 0.960 | 0.853 | 0.900 | |
Total | 416 × 416 | 2213 | 1949 | 102 | 264 | 0.950 | 0.881 | 0.913 |
Sort | Size | IOU | Face | Face_Mask | WMI |
---|---|---|---|---|---|
Proposed work | 416 × 416 | [email protected] | 0.979 | 0.995 | 0.973 |
416 × 416 | [email protected] | 0.978 | 0.995 | 0.983 | |
416 × 416 | [email protected]:.95 | 0.767 | 0.939 | 0.834 | |
YOLO-v4 | 416 × 416 | [email protected] | 0.943 | 0.969 | 0.944 |
416 × 416 | [email protected] | 0.680 | 0.899 | 0.800 | |
416 × 416 | [email protected]:.95 | 0.541 | 0.740 | 0.670 | |
YOLO-v3 | 416 × 416 | [email protected] | 0.921 | 0.981 | 0.941 |
416 × 416 | [email protected] | 0.617 | 0.888 | 0.835 | |
416 × 416 | [email protected]:.95 | 0.559 | 0.789 | 0.724 | |
SSD | 300 × 300 | [email protected] | 0.941 | 0.986 | 0.988 |
300 × 300 | [email protected] | 0.503 | 0.920 | 0.926 | |
300 × 300 | [email protected]:.95 | 0.518 | 0.789 | 0.790 | |
Faster R-CNN | 600 × 600 | [email protected] | 0.943 | 0.974 | 0.950 |
600 × 600 | [email protected] | 0.700 | 0.927 | 0.866 | |
600 × 600 | [email protected]:.95 | 0.612 | 0.824 | 0.769 |
Model | [email protected] | [email protected] | [email protected]:95 |
---|---|---|---|
Proposed work | 0.983 | 0.985 | 0.847 |
YOLO-v4 | 0.952 | 0.793 | 0.680 |
YOLO-v3 | 0.948 | 0.780 | 0.689 |
SSD | 0.972 | 0.783 | 0.691 |
Faster R-CNN | 0.956 | 0.831 | 0.735 |
Function | Train Time | Face | Face_Mask | WMI | [email protected] |
---|---|---|---|---|---|
H-swish | 2.834 h | 0.979 | 0.995 | 0.973 | 0.983 |
Mish | 3.902 h | 0.971 | 0.995 | 0.973 | 0.980 |
L-ReLU | 2.812 h | 0.975 | 0.985 | 0.974 | 0.978 |
ReLU | 3.056 h | 0.970 | 0.972 | 0.969 | 0.970 |
Sigmoid | 2.985 h | 0.966 | 0.968 | 0.963 | 0.966 |
CSP1_X | CSP2_X | H-Swish | Face | Face_Mask | WMI | [email protected] | FPS |
---|---|---|---|---|---|---|---|
× | × | × | 0.943 | 0.969 | 0.944 | 0.952 | 23.83 |
√ | × | × | 0.982 | 0.984 | 0.972 | 0.979 | 43.47 |
× | √ | × | 0.969 | 0.993 | 0.962 | 0.975 | 45.45 |
√ | √ | × | 0.971 | 0.993 | 0.967 | 0.977 | 47.65 |
√ | √ | √ | 0.979 | 0.995 | 0.973 | 0.983 | 54.57 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, J.; Zhang, W. Face Mask Wearing Detection Algorithm Based on Improved YOLO-v4. Sensors 2021, 21, 3263. https://doi.org/10.3390/s21093263
Yu J, Zhang W. Face Mask Wearing Detection Algorithm Based on Improved YOLO-v4. Sensors. 2021; 21(9):3263. https://doi.org/10.3390/s21093263
Chicago/Turabian StyleYu, Jimin, and Wei Zhang. 2021. "Face Mask Wearing Detection Algorithm Based on Improved YOLO-v4" Sensors 21, no. 9: 3263. https://doi.org/10.3390/s21093263
APA StyleYu, J., & Zhang, W. (2021). Face Mask Wearing Detection Algorithm Based on Improved YOLO-v4. Sensors, 21(9), 3263. https://doi.org/10.3390/s21093263