MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction
<p>The overall network architecture of YOLOv8.</p> "> Figure 2
<p>The overall network architecture of MSCD-YOLO.</p> "> Figure 3
<p>The architecture of MV2.</p> "> Figure 4
<p>The architecture of MViT Block module.</p> "> Figure 5
<p>The architecture of SPD-Conv; (<b>a</b>) SPD-Conv; (<b>b</b>) ReSPD-Conv.</p> "> Figure 6
<p>The architecture of CGA feature fusion architecture; (<b>a</b>) CGAFusion; (<b>b</b>) CGA module.</p> "> Figure 7
<p>The improvement of Head; (<b>a</b>) DyHead; (<b>b</b>) DEHead; (<b>c</b>) Conv; (<b>d</b>) Deformable Conv; (<b>e</b>) coordinate attention; (<b>f</b>) Efficient Multi-Scale attention.</p> "> Figure 8
<p>Results of the ablation experiment on the Crowdhuman datasets.</p> "> Figure 9
<p>Results of the ablation experiment on the Widerperson datasets.</p> "> Figure 10
<p>Comparison of different models of mAP@0.5, mAP@0.5-0.95, Recall, Param on the Crowdhuman datasets.</p> "> Figure 11
<p>Comparison of different models of mAP@0.5, mAP@0.5-0.95, Recall, Param on the Widerperson datasets.</p> "> Figure 12
<p>Comparison of feature extraction between different models (street).</p> "> Figure 13
<p>Comparison of feature extraction between different models (train).</p> "> Figure 14
<p>Comparison of feature extraction between different models (night).</p> "> Figure 15
<p>Comparison of feature extraction between different models (mall).</p> ">
Abstract
:1. Introduction
- We chose to introduce the MobileViT backbone network to replace the backbone network of YOLOv8. By combining local and global feature extraction methods, the extracted features contain more information, while MobileViT’s lightweight structure achieves a reduction in model weight.
- We designed SC-Neck, where the neck network incorporates SPD-Conv for information-preserving downsampling. We also proposed the ReSPD-Conv for upsampling feature maps, achieving an information-preserving neck network through auxiliary upsampling and downsampling. Additionally, CGAFusion is introduced for feature fusion, and the P2 layer is included to improve small object detection performance.
- Finally, we added DEHead to the detection head. By incorporating the EMA attention mechanism for modeling long-distance dependencies and replacing the scale attention mechanism in DyHead while removing the task attention mechanism, we further lightweighted the detection head.
2. Materials and Methods
2.1. YOLOv8 Algorithm
2.2. MSCD-YOLO Network Model
2.2.1. Feature Extraction Network
2.2.2. Feature Fusion Network
2.2.3. Detection Head
3. Results
3.1. Experimental Environments and Dataset
3.2. Evaluation Indexes
3.3. Experimental Results and Analysis
3.3.1. Ablation Experiments
3.3.2. Comparison and Analysis of Different Model
3.3.3. Visual Analysis of Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- dGirshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. 2016. pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
- Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2024; pp. 10012–10022. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
- Xu, C.; Wang, G.; Yan, S.; Yu, J.; Zhang, B.; Dai, S.; Li, Y.; Xu, L. Fast vehicle and pedestrian detection using improved Mask R-CNN. Math. Probl. Eng. 2020, 2020, 5761414. [Google Scholar] [CrossRef]
- Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: Gated fusion double SSD for multispectral pedestrian detection. arXiv 2019, arXiv:1903.06999. [Google Scholar]
- Zhao, L.; Li, S. Object detection algorithm based on improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
- Liu, H.; Sun, F.; Gu, J.; Deng, L. Sf-yolov5: A lightweight small object detection algorithm based on improved feature fusion mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef]
- Li, S.; Wang, S.; Wang, P. A small object detection algorithm for traffic signs based on improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef]
- Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
- Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
- Wang, B.; Li, Y.Y.; Xu, W.; Wang, H.; Hu, L. Vehicle–Pedestrian Detection Method Based on Improved YOLOv8. Electronics 2024, 13, 2149. [Google Scholar] [CrossRef]
- Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation. IEEE Trans. Multimed. 2021, 24, 1735–1749. [Google Scholar] [CrossRef]
- Zhang, R.; Liu, G.; Zhang, Q.; Lu, X.; Dian, R.; Yang, Y.; Xu, L. Detail-Aware Network for Infrared Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000314. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar] [CrossRef]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
- Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
- Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar] [CrossRef]
- Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2019, 22, 380–393. [Google Scholar] [CrossRef]
Environment | Configuration |
---|---|
Operating System | Ubuntu 20.04 |
GPU | 3090 (24 GB) |
CPU | Intel Xeon Platinum 8362 |
Python | 3.8.19 |
Deep Learning Framework | torch 1.13.1 + cu117 |
Optimizer | SGD |
Datasets | Number of Images | Target Instances | Target Instance |
---|---|---|---|
Caltech-USA | 42,782 | 13,674 | 0.32 |
KITTI | 3712 | 2322 | 0.63 |
COCOPerson | 64,115 | 257,252 | 4.01 |
Cityperson | 2975 | 19,238 | 6.47 |
Crowdhuman | 15,000 | 470,000+ | 22.63 |
Widerperson | 8000 | 240,000 | 26.51 |
Baseline | MViT | MViT (P2) | SCNeck | DyHead | DEHead | Param (M) | R (%) | mAP50 (%) | mAP50-95 (%) |
---|---|---|---|---|---|---|---|---|---|
YOLOv8n | 3.0 | 65.9 | 75.8 | 48.0 | |||||
YOLOv8n_MViT | √ | 1.18 | 64.5 | 74.7 | 46.4 | ||||
YOLOv8n_MViT_P2 | √ | 1.28 | 66.7 | 77.1 | 48.7 | ||||
YOLOv8n_MViT_SCNeck | √ | √ | 1.97 | 68.8 | 79.3 | 51.5 | |||
YOLOv8n_MViT_DyHead | √ | √ | 1.37 | 68.2 | 78.5 | 50.7 | |||
YOLOv8n_MViT_DEHead | √ | √ | 1.35 | 69.1 | 79.2 | 51.5 | |||
MSCD-YOLO | √ | √ | √ | 2.03(−0.97) | 70.4(+4.5) | 80.4(+4.6) | 53.3(+5.3) |
Baseline | MViT | MViT (P2) | SCNeck | DyHead | DEHead | Param (M) | R (%) | mAP50 (%) | mAP50-95 (%) |
---|---|---|---|---|---|---|---|---|---|
YOLOv8n | 3.0 | 79.6 | 88.3 | 62.3 | |||||
YOLOv8n_MViT | √ | 1.18 | 78.4 | 88.0 | 61.6 | ||||
YOLOv8n_MViT_P2 | √ | 1.28 | 80.4 | 89.1 | 62.9 | ||||
YOLOv8n_MViT_SCNeck | √ | √ | 1.97 | 81.2 | 89.8 | 64.3 | |||
YOLOv8n_MViT_DyHead | √ | √ | 1.37 | 80.3 | 89.5 | 63.8 | |||
YOLOv8n_MViT_DEHead | √ | √ | 1.35 | 81.2 | 89.8 | 64.4 | |||
MSCD-YOLO | √ | √ | √ | 2.03(−0.97) | 81.5(+1.9) | 90.1(+1.8) | 64.9(+2.6) |
Method | Param (M) | R (%) | mAP50 (%) | mAP50-95 (%) |
---|---|---|---|---|
Faster-RCNN | 41.34 | 78.0 | 49.9 | |
Mask-RCNN | 43.99 | 77.0 | 47.0 | |
SSD | 23.746 | 69.6 | 34.7 | |
YOLOv5n | 1.76 | 60.8 | 70.8 | 40.4 |
YOLOv5s | 7.02 | 67.0 | 77.2 | 47.2 |
YOLOv7-Tiny | 6.01 | 69.41 | 78.56 | 46.22 |
YOLOv8n(Baseline) | 3.0 | 65.9 | 75.8 | 48 |
YOLOv8s | 11.1 | 70.8 | 80.1 | 53.2 |
YOLOv9-Tiny | 2.61 | 65.2 | 75.3 | 47.9 |
YOLOv10n | 2.7 | 64.3 | 74.8 | 46.9 |
YOLOv10s | 8.03 | 69.6 | 79.5 | 52.3 |
YOLOv11n | 2.59 | 64.9 | 75.4 | 47.4 |
YOLOv11s | 9.41 | 70.4 | 79.8 | 52.8 |
MSCD-YOLO(Ours) | 2.03 | 70.6 | 80.4 | 53.3 |
Method | Param (M) | R (%) | mAP50 (%) | mAP50-95 (%) |
---|---|---|---|---|
Faster-RCNN | 41.34 | 86.9 | 58.9 | |
Mask-RCNN | 43.99 | 86.9 | 59.0 | |
SSD | 23.746 | 77.9 | 43.7 | |
YOLOv5n | 1.76 | 76.3 | 86.8 | 57.7 |
YOLOv5s | 7.02 | 77.3 | 88.2 | 60.5 |
YOLOv7-Tiny | 6.01 | 80.5 | 89.0 | 59.5 |
YOLOv8n(Baseline) | 3.0 | 79.6 | 88.3 | 62.3 |
YOLOv8s | 11.1 | 82.2 | 90.1 | 64.6 |
YOLOv9-Tiny | 2.61 | 81.9 | 88.6 | 62.8 |
YOLOv10n | 2.7 | 78.5 | 87.6 | 61.6 |
YOLOv10s | 8.03 | 80.4 | 89.7 | 64.0 |
YOLOv11n | 2.59 | 79.3 | 88.2 | 62.1 |
YOLOv11s | 9.41 | 81.3 | 89.9 | 64.4 |
MSCD-YOLO(Ours) | 2.03 | 81.5 | 90.1 | 64.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Q.; Li, Z.; Zhang, L.; Deng, J. MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors 2025, 25, 438. https://doi.org/10.3390/s25020438
Liu Q, Li Z, Zhang L, Deng J. MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors. 2025; 25(2):438. https://doi.org/10.3390/s25020438
Chicago/Turabian StyleLiu, Qiang, Zhongmin Li, Lei Zhang, and Jin Deng. 2025. "MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction" Sensors 25, no. 2: 438. https://doi.org/10.3390/s25020438
APA StyleLiu, Q., Li, Z., Zhang, L., & Deng, J. (2025). MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors, 25(2), 438. https://doi.org/10.3390/s25020438