[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Terrain Traversability via Sensed Data for Robots Operating Inside Heterogeneous, Highly Unstructured Spaces
Previous Article in Journal
Sensor for a Solid–Liquid Tribological System
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction

by
Qiang Liu
,
Zhongmin Li
*,
Lei Zhang
and
Jin Deng
School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(2), 438; https://doi.org/10.3390/s25020438
Submission received: 17 December 2024 / Revised: 10 January 2025 / Accepted: 11 January 2025 / Published: 13 January 2025
(This article belongs to the Section Intelligent Sensors)
Figure 1
<p>The overall network architecture of YOLOv8.</p> ">
Figure 2
<p>The overall network architecture of MSCD-YOLO.</p> ">
Figure 3
<p>The architecture of MV2.</p> ">
Figure 4
<p>The architecture of MViT Block module.</p> ">
Figure 5
<p>The architecture of SPD-Conv; (<b>a</b>) SPD-Conv; (<b>b</b>) ReSPD-Conv.</p> ">
Figure 6
<p>The architecture of CGA feature fusion architecture; (<b>a</b>) CGAFusion; (<b>b</b>) CGA module.</p> ">
Figure 7
<p>The improvement of Head; (<b>a</b>) DyHead; (<b>b</b>) DEHead; (<b>c</b>) Conv; (<b>d</b>) Deformable Conv; (<b>e</b>) coordinate attention; (<b>f</b>) Efficient Multi-Scale attention.</p> ">
Figure 8
<p>Results of the ablation experiment on the Crowdhuman datasets.</p> ">
Figure 9
<p>Results of the ablation experiment on the Widerperson datasets.</p> ">
Figure 10
<p>Comparison of different models of mAP@0.5, mAP@0.5-0.95, Recall, Param on the Crowdhuman datasets.</p> ">
Figure 11
<p>Comparison of different models of mAP@0.5, mAP@0.5-0.95, Recall, Param on the Widerperson datasets.</p> ">
Figure 12
<p>Comparison of feature extraction between different models (street).</p> ">
Figure 13
<p>Comparison of feature extraction between different models (train).</p> ">
Figure 14
<p>Comparison of feature extraction between different models (night).</p> ">
Figure 15
<p>Comparison of feature extraction between different models (mall).</p> ">
Versions Notes

Abstract

:
Pedestrian detection is widely used in real-time surveillance, urban traffic, and other fields. As a crucial direction in pedestrian detection, dense pedestrian detection still faces many unresolved challenges. Existing methods suffer from low detection accuracy, high miss rates, large model parameters, and poor robustness. In this paper, to address these issues, we propose a lightweight dense pedestrian detection model with finer-grained feature information interaction called MSCD-YOLO, which can achieve high accuracy, high performance and robustness with only a small number of parameters. In our model, the light-weight backbone network MobileViT is used to reduce the number of parameters while efficiently extracting both local and global features; the SCNeck neck network is designed to fuse the extracted features without losing information; and the DEHead detection head is utilized for multi-scale feature fusion to detect the targets. To demonstrate the effectiveness of our model, we conducted tests on the highly challenging dense pedestrian detection datasets Crowdhuman and Widerperson. Compared to the baseline model YOLOv8n, MSCD-YOLO achieved a 4.6% and 1.8% improvement in [email protected], and a 5.3% and 2.6% improvement in [email protected]:0.95 on the Crowdhuman and Widerperson datasets, respectively. The experimental results show that under the same experimental conditions, MSCD-YOLO significantly outperforms the original model in terms of detection accuracy, efficiency, and model complexity.

1. Introduction

Object detection is an important research branch in computer vision, widely applied in various fields such as urban traffic, autonomous driving, and surveillance. However, in its downstream task of dense pedestrian detection, common challenges such as small and densely packed targets with occlusion issues make this task particularly difficult within the broader object detection domain. Traditional object detection algorithms mainly rely on a “handcrafted feature + classification” framework, such as Support Vector Machines (SVM) [1], Histograms of Oriented Gradients (HOG) [2], and Scale-Invariant Feature Transform (SIFT) [3]. However, these methods are heavily feature-dependent, making it difficult to handle multi-scale and pose variations, and they lack end-to-end learning capabilities. In dense crowd scenes, these methods suffer from low accuracy, missed detections, and false detections. Over time, with the continuous development of deep learning, more and more deep learning methods have been applied in object detection. Convolutional neural networks (CNNs) are frequently used in object detection, such as in two-stage algorithms based on region proposals, like the Region-based Convolutional Neural Network (RCNN) series [4,5,6,7], as well as single-stage regression-based algorithms like Single Shot MultiBox Detector (SSD) [8] and the You Only Look Once (YOLO) series [9,10,11,12,13], which have achieved remarkable results. With the emergence of Transformer-based models [14,15,16,17,18], Transformers have been performing increasingly well in object detection tasks, gradually rivaling convolutional neural networks. Whether it is Vision Transformer, Swin Transformer, or their various later variants [19,20,21], they are continuously pushing the benchmarks in object detection tasks.
The two-stage RCNN series has been highly successful in pedestrian detection tasks. Zhang et al. [22] improved detection accuracy by designing a new Aggregation Loss function (AggLoss) and a Partial Occlusion-aware pooling unit (PORoI) based on RCNN. However, this also increased the computational requirements. To achieve faster and more accurate detection, Xu et al. [23] proposed a lightweight model by replacing the backbone network and designing SF-FPN based on Mask-RCNN, improving both model efficiency and accuracy. Zheng et al. [24] enhanced the one-stage SSD algorithm by designing a gated fusion unit (GRU). Experiments showed that the improved GFD-SSD achieved better accuracy compared to Faster-RCNN, but it still could not perfectly balance accuracy and model complexity. Since its introduction, the YOLO model has been widely applied in object detection due to its excellent real-time performance. Zhao et al. [25] improved YOLOv3’s K-means clustering algorithm for better boundary box prediction, achieving improvements in recall, average precision, and F1 score compared to the original algorithm. Wang et al. [26] introduced PB-FPN to improve the neck network of YOLOv5, allowing the network to retain more information during feature fusion and enhancing the representation of small objects. Li et al. [27] improved YOLOv7 by introducing omnidimensional dynamic convolution (ODConv) and a mixed attention mechanism (ACmix), making the model more sensitive to small object detection. Li et al. [28] tackled the problem of false positives and missed detections in UAV detection of small objects by introducing BiFPN, a feature fusion method that models short- and long-range dependencies by adding skip connections between FPN and PAN, reducing information loss and retaining much of the feature information. Lou et al. [29] proposed DC-YOLO, which improved the convolutional architecture of convolutional neural networks by integrating downsampling at different scales, preserving more feature information, and using skip connections to integrate shallow features into deeper layers, improving YOLO’s performance. Wang et al. [30] introduced the BiFormer attention mechanism to more effectively capture relationships between features, and replaced the original loss function with the WIoUv3 loss function, which emphasizes the IoU of small objects, thus improving the model’s performance in small object detection. Currently, many works also focus on enhancing feature representation. Zhang et al. [31] effectively enhanced the feature representation of infrared images by combining frequency domain information feature extraction branches with spatial domain information; additionally, they designed Channel SE-Attention and Position SE-Attention to achieve feature responses between channel information and position information. Zhang et al. [32] designed the Detail Awareness Unit (DAU) and the Contrast Improvement Module (CIM). The DAU integrates multi-scale information, while the CIM is responsible for reconstructing and enhancing details to achieve more powerful feature representation.
Due to the challenges in dense pedestrian detection tasks, such as complex backgrounds, high occlusion rates, small detection targets, and varying scales, these issues significantly affect detection accuracy. To address these problems, current research directions mainly involve designing more complex model architectures, proposing new attention mechanisms or loss functions, and innovatively designing new network modules to adapt to these tasks. Therefore, we propose MSCD-YOLO to tackle the various challenges in pedestrian detection. The main contributions of this paper are as follows:
  • We chose to introduce the MobileViT backbone network to replace the backbone network of YOLOv8. By combining local and global feature extraction methods, the extracted features contain more information, while MobileViT’s lightweight structure achieves a reduction in model weight.
  • We designed SC-Neck, where the neck network incorporates SPD-Conv for information-preserving downsampling. We also proposed the ReSPD-Conv for upsampling feature maps, achieving an information-preserving neck network through auxiliary upsampling and downsampling. Additionally, CGAFusion is introduced for feature fusion, and the P2 layer is included to improve small object detection performance.
  • Finally, we added DEHead to the detection head. By incorporating the EMA attention mechanism for modeling long-distance dependencies and replacing the scale attention mechanism in DyHead while removing the task attention mechanism, we further lightweighted the detection head.

2. Materials and Methods

2.1. YOLOv8 Algorithm

The YOLOv8 model was proposed by Ultralytics in 2023. Its main architecture is similar to that of YOLOv5, consisting of three parts: the backbone, neck, and head. The structure of the YOLOv8 model is shown in Figure 1. By adjusting the depth and width, five model scales—n, s, m, l, and x—are derived, with each subsequent model being larger. In terms of details, the backbone primarily uses convolution for feature extraction. YOLOv8 replaces YOLOv5’s C3 module and SPP module with C2f and SPPF, respectively, which offer richer gradient flows. The structure of the V8 neck network is fundamentally the same as that of V5, employing an FPN [33] + PAN [34] structure, while replacing the C3 module with the C2f module. Finally, the detection head of YOLOv5 is replaced with a decoupled head, separating the classification and regression tasks, so they no longer share parameters. In this way, it prevents interference between the two tasks, allowing each to achieve better results. Additionally, YOLOv8 adopts a Task Alignment Learning (TAL) dynamic matching strategy instead of the previous IoU matching strategy, enhancing its ability to distinguish between positive and negative samples. Furthermore, it transitions from an anchor-based approach in YOLOv5 to an anchor-free approach, reducing the model’s dependency on prior boxes and improving its learning capability.

2.2. MSCD-YOLO Network Model

To better achieve dense pedestrian detection, we focused on common challenges in this task, such as occlusion, small target size, high miss rates, large scale variations, and complex scenes. By conducting an in-depth analysis and research on solutions to these issues, we propose our MSCD-YOLO model, optimizing dense pedestrian detection tasks. The MSCD-YOLO model is based on the YOLOv8n model, and improvements have been made to address various issues specifically for dense pedestrian detection tasks. The structure of the MSCD-YOLO model is shown in Figure 2 below. We replaced the backbone network of YOLOv8 with the lighter MViT, significantly reducing model complexity and allowing for better integration of local and global feature information. The feature fusion network plays a crucial role in the model; the YOLOv8 model uses traditional nearest neighbor sampling or bilinear interpolation for upsampling and strided convolution for downsampling. This method of feature fusion can easily lead to the loss of small target information, and a small receptive field is not suitable for scenarios involving multiple targets, small targets, or occlusion. We designed the SC-Neck to achieve lossless upsampling, downsampling, and richer feature fusion. Additionally, we improved DyHead to DEHead to enhance the model’s detection capability in dense pedestrian detection scenarios, thereby improving the model’s accuracy and stability.

2.2.1. Feature Extraction Network

The backbone network of YOLOv8 is responsible for feature extraction from the input image, consisting of convolution modules and C2f modules. The successive convolution modules perform local feature mapping on the input features, with different convolution kernels representing different receptive field sizes. The C2f module extracts richer features through gradient flow by using multiple consecutive bottlenecks. However, this local feature extraction does not fully capture the information represented by the features. In contrast, Transformer-based models perform global feature mapping on the input features. By applying self-attention operations between each patch and other patches, global features can be effectively captured. However, this approach significantly increases the model’s parameters, greatly reducing the efficiency of the feature extraction network. In 2021, Apple introduced its MobileViT model [35], which effectively combines CNN and Transformer, addressing both the local feature limitations of CNNs and the large parameter size and strict training requirements of Transformers. In this paper, we replaced the backbone network of the YOLOv8 model with MViT to optimize it.
In MobileViT, the MV2 module is responsible for local mapping with different receptive fields, while the MViT module handles global feature mapping. The MV2 module originates from Apple’s MobileNetV2 [36]. Compared to traditional residual convolution structures, it uses an inverted residual structure. Typically, residual structures shrink feature channels, process features, and then expand them again. However, the lightweight nature of depthwise separable convolution (DWConv), compared to standard convolution, allows the inverted residual structure to avoid increasing the number of parameters. The refinement of channels in the inverted residual structure enables better information propagation through the gradient layers during feature learning. The structure of the MV2 module is shown in Figure 3 below.
The drawback of ViT is that it focuses solely on global features while neglecting the importance of local features, and the significant increase in the number of parameters leads to harsh training conditions. The MViT module addresses these issues by effectively integrating local and global features while using only a small number of parameters.
For the input feature X R H × W × C , where H and W represent the height and width of the feature, and C represents the number of channels, we first use an n × n convolution (usually of size 3 × 3 ) for local feature extraction. Then, we use a 1 × 1 convolution to map it to a higher dimension X 1 R H × W × D . Next, we unfold it to X D R N × P × D (where P is the size of each patch ( h × w ), and N is the number of patches ( H × W P )), and perform lightweight self-attention for global feature extraction. In the example, we only apply transformers to pixels of the same color. It is worth noting that, despite simplifying computations, the initial n × n convolution ensures that each pixel has a receptive field of n × n . Therefore, the effective receptive field for self-attention calculations is still H × W . Compared to the original ViT, this approach reduces redundant computations while maintaining effectiveness, especially in scenarios where adjacent pixel information does not vary significantly. Finally, we fold it back, adjust it with another 1 × 1 convolution, and concatenate it with the original features to prevent gradient vanishing and optimize training. We then perform a final n × n convolution to obtain our output features. The detailed structure of the MViT Block module is shown in Figure 4.

2.2.2. Feature Fusion Network

In the YOLOv8 model, feature fusion still employs FPN combined with PAN while utilizing the C2f module to bring in more gradient flow. This top-down and then reverse operation effectively merges features from different levels and enhances the flow of information from shallow to deep layers. However, in YOLOv8, the upsampling uses nearest neighbor interpolation, and the downsampling employs a 3 × 3 convolution with a stride of 2. It can easily lead to information loss, which is detrimental to subsequent feature fusion.
To address this issue, we focused on SPD-Conv [37], a type of convolution that does not use strides. This non-strided operation helps reduce feature loss during the feature fusion phase. Based on this, we proposed the inverse transformation of SPD-Conv, called ReSPD-Conv, to be used for upsampling. Thus, we achieved non-strided upsampling and downsampling in both FPN and PAN. In the neck network, we designed our SC-Neck using this type of convolution. By adding auxiliary branches alongside the upsampling and downsampling processes in YOLOv8, we ensured that no information is lost. Additionally, we introduced a small target detection layer, P2, and performed CGAFusion for feature integration. This approach allows us to achieve a neck network for feature fusion without losing information. The operational flow of SPD-Conv is shown below.
f 0 , 0 = X [ 0 : S : s c a l e , 0 : S : s c a l e ] , f 1 , 0 = X [ 1 : S : s c a l e , 0 : S : s c a l e ] , , f s c a l e 1 , 0 = X [ s c a l e 1 : S : s c a l e , 0 : S : s c a l e ] ; f 0 , 1 = X [ 0 : S : s c a l e , 1 : S : s c a l e ] , f 1 , 0 = X [ 1 : S : s c a l e , 1 : S : s c a l e ] , , f s c a l e 1 , 1 = X [ s c a l e 1 : S : s c a l e , 1 : S : s c a l e ] ; f 0 , s c a l e 1 = X [ 0 : S : s c a l e , s c a l e 1 : S : s c a l e ] , f 1 , s c a l e 1 = X [ 1 : S : s c a l e , s c a l e 11 : S : s c a l e ] , , f s c a l e 1 , 1 = X [ s c a l e 1 : S : s c a l e , s c a l e 1 : S : s c a l e ]
The structure of SPD is shown in Figure 5a, and the structure of ReSPD is shown in Figure 5b. We input a feature X R H × W × C 1 , slice it, and then concatenate it to form X 1 R H / 2 × W / 2 × 4 C 1 . Next, we use a 1 × 1 convolution to adjust it to the output feature X o u t R H / 2 × W / 2 × c 2 , eliminating the need for strided convolution operations.
Since we have constructed an auxiliary feature fusion layer for the original YOLOv8 model, we need to merge the features of the original model with those extracted by the auxiliary layer. We achieve it by introducing the CGAFusion module [38] for feature fusion. The CGAFusion module is illustrated in Figure 6 below. It performs a series of weighted combination operations on two different features F 1 , F 2 R H × W × C using the weights obtained from the CGA module to facilitate feature fusion. The formula for CGAFusion is shown in Equation (1), and CGAFusion is illustrated in Figure 6a.
F f u s e = C 1 × 1 ( F 1 · W + F 2 · ( 1 W ) + F 1 + F 2 )
The CGA module uses the input feature X R H × W × C 1 to obtain the weights between space and channels. It then applies spatial attention (SA) to obtain W s and channel attention (CA) to obtain W c . W c o s is obtained by weighting with the input X, and then pixel attention (PA) is applied to obtain the final weights W. The CGA module is defined by Equations (2)–(4), and the structure of the CGA module is shown in Figure 6b.
W s = C 7 × 7 ( [ X G A P s , X G M P s ] )
W c = C 1 × 1 ( m a x ( 0 , C 1 × 1 ( X G A P c ) ) )
W = σ ( C 7 × 7 ( C S ( [ X 1 , W cos ] ) ) )

2.2.3. Detection Head

The detection head in YOLOv8 has been updated to a decoupled head compared to previous generations of YOLO. It separates the classification task from the regression task, meaning the two no longer share parameters, allowing them to operate independently and better fulfill their respective tasks. However, the challenges faced in dense pedestrian detection, such as small objects, occlusion, and varying object scales, prompted us to introduce DyHead and make targeted improvements to design our DEHead.
DyHead [39], introduced by Microsoft in 2021, is composed of stacked DyHead blocks. Each DyHead block includes spatial attention, scale attention, and task attention mechanisms. These mechanisms perform scale-aware, spatial-aware, and task-aware operations at different levels, while integrating features across different levels to improve sensitivity to features related to small objects. The spatial awareness operation π S is carried out using DCNV2, which adjusts the receptive field to achieve spatial perception. Figure 7c,d show how DCNV2 enlarges the receptive field compared to standard convolution. In dense pedestrian detection, there is only one class, so we removed the task-aware operation π L from DyHead. Additionally, we replaced the scale attention mechanism π C with the EMA attention mechanism (Efficient Multi-Scale attention) π E to enhance the detection head’s ability to detect small objects. The formulas for DyHead and DEHead are provided, and the structures of DyHead Block and DEHead Block are shown in Figure 7a,b.
W ( F ) = π C π S π L ( F ) · F · F · F
W ( F ) = π E π S F · F · F
The EMA [40] builds upon its predecessor, the Coordinate Attention (CoorA) [41]. Generally, we believe that global average pooling can be used to extract channel information, as seen in the SE attention mechanism. The CoorA further integrates spatial information across dimensions into the extracted channel information, making it particularly suitable for enhancing feature fusion. The CoorA performs one-dimensional global pooling on the input features X R H × W × C in both the height (H) and width (W) directions to obtain X 1 R 1 × W × C and X 2 R H × 1 × C . Then, the obtained X 1 and X 2 undergo concatenation and convolution operations for information extraction. Afterward, batch normalization (BatchNorm), activation functions (Sigmoid), and other processing steps are applied. This process effectively preserves information that can focus on pixel-level targets. Building upon this, EMA introduces an additional branch that interacts with the features modeled by the CoorA, allowing for effective information exchange between shallow and deep layers. This global modeling can provide excellent pixel-level attention for images. The structures of the CoorA and the EMA are illustrated in Figure 7e,f.

3. Results

3.1. Experimental Environments and Dataset

The configuration of the experimental environment is as follows. The operating system is Ubuntu 20.04, the GPU is a 24GB RTX 3090, the CPU is an Intel Xeon Platinum 8362, the Python version is 3.8, and the PyTorch framework is torch 1.13.1 + cu117. The training image size is uniformly set to 640 × 640, the number of training epochs is 100, the optimizer chosen is SGD, and the training batch size is set to the maximum batch size with default hyperparameters. The experimental environment configuration is shown in Table 1 below.
To verify the generalization ability and accuracy of the model, the datasets we selected must contain dense detection targets, significant occlusion, and variable scenes. In this way, we can effectively validate the proposed model.
There are many existing pedestrian detection datasets, such as Caltech-USA, KITTI, Cityperson [42], and COCOPerson, but these datasets suffer from issues like having few targets, single scenes, and limited scales. We validate our model using two different public datasets, Crowdhuman [43] and Widerperson [44]. Comparing across different datasets, it can effectively verify the model’s generalization ability and accuracy.
First, the Crowdhuman dataset contains 15,000 training images, 5000 test images, and 4370 validation images. The training and validation sets include at least 470k instances, with approximately 23 targets per photo. It encompasses a wide variety of scenes, such as roads, parks, indoor environments, and construction sites, with a predominance of small and occluded targets, making it very suitable for our experimental validation.
Next, Widerperson is another commonly used dataset for dense pedestrian detection, containing 8000 training images and 1000 validation images, with a total of 240k targets. The average number of targets per image reaches 29.51, making it the dataset with the highest density among currently available public datasets. It also has rich scene variations and target scale diversity, making it very suitable for this experimental validation. Table 2 provides a comparison of the different datasets.

3.2. Evaluation Indexes

The evaluation metrics selected in this paper are model parameters (Param), mean average precision (mAP), and recall (R). The model parameters (Param) serve as an indicator of whether a model is lightweight, and the recall (R) effectively reflects the rate of missed detections in the model. The calculation of recall is shown in Equation (8).
P = T P F P + T P
R = T P F N + T P
In the equation, the parameter TP (True Positives) represents the number of correct detections, FP (False Positives) represents the number of incorrect detections, and FN (False Negatives) represents the number of missed detections. Three parameters form the basis for the calculation of recall.
Mean average precision (mAP) is a metric that effectively reflects the accuracy of the model. It is composed of precision (P) and recall (R), and its calculation is shown in Equation (10). By setting different IoU thresholds (such as IoU at 0.5 or IoU from 0.5 to 0.95), we can obtain the values of [email protected] and [email protected]:0.95. The specific calculation method is as follows.
A P i = 0 1 P i ( R i ) d R i = k = 0 n P i ( k ) Δ R i ( k )
m A P = 1 m i = 1 m A P i

3.3. Experimental Results and Analysis

3.3.1. Ablation Experiments

We conducted ablation experiments to analyze the effectiveness of our improved modules by adding different improvement modules to the YOLOv8n model (Baseline) and evaluating their performance. We refer to the baseline as YOLOv8n. The experimental result of replacing the backbone network is denoted as YOLOv8n_MViT. The result of adding the small target layer is denoted as YOLOv8n_MViT_P2. Replacing the neck network with our SCNeck is denoted as YOLOv8n_MViT_SCNeck, and replacing the detection head with DyHead and our DEHead is denoted as YOLOv8n_MViT_DyHead and YOLOv8n_MViT_DEHead. Finally, combining all the improved modules results in our final model, named MSCD-YOLO. All models were trained with the default parameters for 100 epochs. The experimental results on the Crowdhuman dataset are shown in Table 3 and Figure 8, and the results on the Widerperson dataset are shown in Table 4 and Figure 9.
From the results, we can see that although replacing the backbone network leads to a slight loss in accuracy, the model with the replaced backbone has only one-third of the parameters compared to the baseline YOLOv8n. Due to MViT’s efficient local and global feature extraction capabilities, combined with the addition of a small object layer, the model surpasses the baseline with minimal additional cost. This gives us a lot of flexibility for subsequent improvements. After replacing the backbone with MViT and adding SCNeck, the finer-grained feature interaction in the neck network effectively integrates low-level positional information with high-level semantic information. This enables the model to better judge edge information. We achieved various accuracy gains. On the Crowdhuman dataset, recall (R) and mAP50, mAP50-95 increased by 2.9%, 3.5%, and 3.5%, respectively; on the Widerperson dataset, recall (R), mAP50, and mAP50-95 increased by 1.6%, 1.5%, and 2.0%, respectively, while bringing only a small increase in parameters. After replacing SCNeck with DyHead, DyHead’s multi-scale interaction mechanism better captured multi-scale information, enabling the model to detect and recognize objects of different sizes and shapes. On the Crowdhuman dataset, recall (R), mAP50, and mAP50-95 increased by 2.3%, 2.7%, and 2.7%, respectively; on the Widerperson dataset, recall (R), mAP50, and mAP50-95 increased by 0.7%, 1.2%, and 1.5%. Next, after replacing DyHead with DEHead, replacing DyHead with DEHead retained DyHead’s ability to interact with multi-scale information while introducing long-range semantic modeling, resulting in deeper semantic information and a more focused perception capability for the model. On the Crowdhuman dataset, recall (R), mAP50, and mAP50-95 increased by 3.2%, 3.4%, and 3.5%, respectively; on the Widerperson dataset, recall (R), mAP50, and mAP50-95 increased by 1.6%, 1.5%, and 2.1%, showing improvements over DyHead. In comparison to the baseline YOLOv8n, the model’s complexity is less than half. Finally, combining all the improvements into MSCD-YOLO, while the model complexity is only two-thirds that of YOLOv8n, recall (R) on the Crowdhuman dataset improved by 4.5%, mAP50 by 4.6%, and mAP50-95 by 5.3%; on the Widerperson dataset, recall (R) improved by 1.9%, mAP50 by 1.8%, and mAP50-95 by 2.6%.

3.3.2. Comparison and Analysis of Different Model

We conducted an effective analysis of the models by comparing one-stage deep learning models like SSD and the YOLO series with two-stage deep learning models such as the RCNN series. All models were trained for 100 epochs with consistent training parameters and without using pretrained weights. The final results are shown in Figure 10 and Figure 11, Table 5 and Table 6.
From the experimental results, we can see that our model achieves an excellent balance between complexity and accuracy. The results show that the model with the smallest complexity is YOLOv5n, with only 1.76M parameters. In comparison, our model adds only 0.27M parameters, indicating that in terms of model complexity, our model is firmly in the top tier alongside YOLOv5n. Regarding the recall rate, our model ranks just behind YOLOv8s, but with only 1/5 of YOLOv8s’s parameter count, the difference is merely 0.4%, demonstrating the model’s outstanding performance. In terms of [email protected] and [email protected]:0.95 accuracy, our model holds a clear lead. Whether compared to a complex two-stage model or the one-stage model, our model achieves the best results with minimal parameters.

3.3.3. Visual Analysis of Results

We conducted feature heatmap visualization to compare the MSCD-YOLO model with the baseline YOLOv8n model to better demonstrate the performance of our model. Feature heatmap visualizations and model inference were performed in various scenes, such as crowded streets during the day, train stations, poorly lit roads at night, and indoor shopping malls. These diverse scenarios, shown in Figure 12, Figure 13, Figure 14 and Figure 15, effectively validate the generalization capability and outstanding performance of our model. Figure 12a represents the image with original labels, Figure 12b shows the feature extraction heatmap and model inference results of the YOLOv8n (Baseline) model, and Figure 12c–e present the feature extraction heatmap and model inference results of the different improvements to the baseline model. Figure 13, Figure 14 and Figure 15 follow the same structure.
From the feature heatmap visualization results (heatmap distribution) and model inference, it can be observed that, compared to the heatmaps of the YOLOv8n model, the integration of SCNeck enables higher-quality information fusion, helping the model focus more on edges. Additionally, the long-range modeling capability of DEHead enriches the semantic information of features, allowing the model to concentrate its attention more effectively. The combination of these two components in the MSCD-YOLO model enables it to focus on the entire person, including actions, movements, and full-body detection, resembling the way the human brain perceives pedestrians. This characteristic highlights the holistic correlation of pedestrians. Compared to the YOLOv8n model, the MSCD-YOLO model demonstrates more precise attention to pedestrians, enhancing its detection capabilities and effectively distinguishing between pedestrian and non-pedestrian regions. Our heatmaps more accurately cover pedestrians without being influenced by other objects. For example, in Figure 12b, due to crowd occlusion, the YOLOv8n model misses detections on the right side of the image. In Figure 13a, although the labels do not include parts of pedestrians obscured by glass, YOLOv8n mistakenly detects two targets and has multiple missed detections on the right side of the image. In contrast, MSCD-YOLO correctly detects these targets and accurately identifies different targets in dense crowd areas. Figure 14 and Figure 15 represent low-light nighttime scenes and indoor environments, respectively. In these scenarios, the YOLOv8n model performs poorly. For instance, in Figure 14b, there are many false detections in the darker areas on the right, while in Figure 15b, there are several missed detections in the right and central regions. The miss rate of YOLOv8n is significantly higher in low-light environments than in well-lit daytime scenes. In contrast, the MSCD-YOLO model performs much better. It can even detect targets not included in the labels, such as the person on the electric scooter on the left and the distant small pedestrian targets in Figure 14e, as well as partially occluded targets detected in the shadowed area on the right.

4. Discussion

Compared to the original YOLOv8n model, the proposed MSCD-YOLO model has improved detection accuracy for dense person detection. However, there are still some minor issues, such as the difficulty in detecting very concealed targets and the cases of missed detections in situations with significant occlusion. For instance, when only a part of the body is exposed, the detection performance is improved, but missed detections still occur. These challenges pose significant obstacles to our model’s detection capabilities. In the future, we hope to continue optimizing the model, such as by incorporating new attention mechanisms. Additionally, although our model is smaller, the specially designed SCNeck leads to a more multi-scale and unique feature fusion method, which inevitably increases inference time. We can reduce inference time while ensuring detection performance by pruning the model, distilling it, reparameterizing, or adopting more innovative Neck fusion layers. At the same time, since the annotations in the Widerperson dataset are not as strong as those in the Crowdhuman dataset, and Widerperson has greater scene variations with a smaller data scale, the model’s improvement on the Widerperson dataset is not as noticeable as on Crowdhuman. To achieve higher accuracy, we can add extra supervision heads or design new attention mechanisms that help the model focus more on key areas.

5. Conclusions

This paper introduces a model called MSCD-YOLO, specifically designed for dense pedestrian detection tasks. It is capable of handling scenarios with dense crowds, occlusion, varying target scales, complex scenes, and a large number of targets. First, to address model complexity, we incorporate MobileViT, which combines local features extracted by CNN with global features obtained by Transformers, thereby increasing the amount of information extracted by the backbone network. Then, to tackle the common challenges in dense pedestrian detection tasks, we propose our SCNeck. This introduces stride-free convolution (SPD) and its reverse operation, ReSPD, creating new auxiliary branches and fusing them with the original neck using CGAFusion. This approach solves the issue of information loss during feature fusion. In addition, we improve the DyHead module to better adapt it for dense pedestrian detection tasks, enabling it to detect small targets more effectively. To evaluate our model, we conducted experiments on two different datasets, Crowdhuman and Widerperson. This approach allows us to not only test the model’s performance but also compare its generalization capabilities. The experimental results demonstrate that our model holds a clear advantage over the current mainstream models.

Author Contributions

Conceptualization, Q.L. and Z.L.; methodology, Q.L. and Z.L.; software, Q.L.; validation, Q.L., L.Z. and J.D.; formal analysis, Q.L. and Z.L.; investigation, Q.L. and J.D.; resources, Q.L. and L.Z.; data curation, Q.L. and Z.L.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and Z.L.; visualization, Q.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. 61263040 and 62262043), and the Natural Science Foundation of Jiangxi Province of China (Grant No. 20202BABL202005).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

You can access the Crowdhuman and Widerperson datasets through the following links. Crowd Human dataset: https://www.crowdhuman.org/ (accessed on 22 July 2024). Wider Person dataset: http://www.cbsr.ia.ac.cn/users/sfzhang/WiderPerson/ (accessed on 22 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  2. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
  3. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  4. dGirshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  5. Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  7. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. 2016. pp. 21–37. [Google Scholar]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  10. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  11. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  12. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  13. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  14. Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  15. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2024; pp. 10012–10022. [Google Scholar]
  16. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  17. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  18. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  19. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  20. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  21. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
  22. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
  23. Xu, C.; Wang, G.; Yan, S.; Yu, J.; Zhang, B.; Dai, S.; Li, Y.; Xu, L. Fast vehicle and pedestrian detection using improved Mask R-CNN. Math. Probl. Eng. 2020, 2020, 5761414. [Google Scholar] [CrossRef]
  24. Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: Gated fusion double SSD for multispectral pedestrian detection. arXiv 2019, arXiv:1903.06999. [Google Scholar]
  25. Zhao, L.; Li, S. Object detection algorithm based on improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
  26. Liu, H.; Sun, F.; Gu, J.; Deng, L. Sf-yolov5: A lightweight small object detection algorithm based on improved feature fusion mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef]
  27. Li, S.; Wang, S.; Wang, P. A small object detection algorithm for traffic signs based on improved YOLOv7. Sensors 2023, 23, 7145. [Google Scholar] [CrossRef]
  28. Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
  29. Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
  30. Wang, B.; Li, Y.Y.; Xu, W.; Wang, H.; Hu, L. Vehicle–Pedestrian Detection Method Based on Improved YOLOv8. Electronics 2024, 13, 2149. [Google Scholar] [CrossRef]
  31. Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep-IRTarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation. IEEE Trans. Multimed. 2021, 24, 1735–1749. [Google Scholar] [CrossRef]
  32. Zhang, R.; Liu, G.; Zhang, Q.; Lu, X.; Dian, R.; Yang, Y.; Xu, L. Detail-Aware Network for Infrared Image Enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5000314. [Google Scholar] [CrossRef]
  33. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  34. Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar] [CrossRef]
  35. Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
  36. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  37. Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
  38. Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
  39. Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
  40. Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  41. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  42. Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
  43. Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar] [CrossRef]
  44. Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2019, 22, 380–393. [Google Scholar] [CrossRef]
Figure 1. The overall network architecture of YOLOv8.
Figure 1. The overall network architecture of YOLOv8.
Sensors 25 00438 g001
Figure 2. The overall network architecture of MSCD-YOLO.
Figure 2. The overall network architecture of MSCD-YOLO.
Sensors 25 00438 g002
Figure 3. The architecture of MV2.
Figure 3. The architecture of MV2.
Sensors 25 00438 g003
Figure 4. The architecture of MViT Block module.
Figure 4. The architecture of MViT Block module.
Sensors 25 00438 g004
Figure 5. The architecture of SPD-Conv; (a) SPD-Conv; (b) ReSPD-Conv.
Figure 5. The architecture of SPD-Conv; (a) SPD-Conv; (b) ReSPD-Conv.
Sensors 25 00438 g005
Figure 6. The architecture of CGA feature fusion architecture; (a) CGAFusion; (b) CGA module.
Figure 6. The architecture of CGA feature fusion architecture; (a) CGAFusion; (b) CGA module.
Sensors 25 00438 g006
Figure 7. The improvement of Head; (a) DyHead; (b) DEHead; (c) Conv; (d) Deformable Conv; (e) coordinate attention; (f) Efficient Multi-Scale attention.
Figure 7. The improvement of Head; (a) DyHead; (b) DEHead; (c) Conv; (d) Deformable Conv; (e) coordinate attention; (f) Efficient Multi-Scale attention.
Sensors 25 00438 g007
Figure 8. Results of the ablation experiment on the Crowdhuman datasets.
Figure 8. Results of the ablation experiment on the Crowdhuman datasets.
Sensors 25 00438 g008
Figure 9. Results of the ablation experiment on the Widerperson datasets.
Figure 9. Results of the ablation experiment on the Widerperson datasets.
Sensors 25 00438 g009
Figure 10. Comparison of different models of [email protected], [email protected], Recall, Param on the Crowdhuman datasets.
Figure 10. Comparison of different models of [email protected], [email protected], Recall, Param on the Crowdhuman datasets.
Sensors 25 00438 g010
Figure 11. Comparison of different models of [email protected], [email protected], Recall, Param on the Widerperson datasets.
Figure 11. Comparison of different models of [email protected], [email protected], Recall, Param on the Widerperson datasets.
Sensors 25 00438 g011
Figure 12. Comparison of feature extraction between different models (street).
Figure 12. Comparison of feature extraction between different models (street).
Sensors 25 00438 g012
Figure 13. Comparison of feature extraction between different models (train).
Figure 13. Comparison of feature extraction between different models (train).
Sensors 25 00438 g013
Figure 14. Comparison of feature extraction between different models (night).
Figure 14. Comparison of feature extraction between different models (night).
Sensors 25 00438 g014
Figure 15. Comparison of feature extraction between different models (mall).
Figure 15. Comparison of feature extraction between different models (mall).
Sensors 25 00438 g015
Table 1. Experimental platform.
Table 1. Experimental platform.
EnvironmentConfiguration
Operating SystemUbuntu 20.04
GPU3090 (24 GB)
CPUIntel Xeon Platinum 8362
Python3.8.19
Deep Learning Frameworktorch 1.13.1 + cu117
OptimizerSGD
Table 2. Dense pedestrian detection dataset.
Table 2. Dense pedestrian detection dataset.
DatasetsNumber of ImagesTarget InstancesTarget Instance
Caltech-USA42,78213,6740.32
KITTI371223220.63
COCOPerson64,115257,2524.01
Cityperson297519,2386.47
Crowdhuman15,000470,000+22.63
Widerperson8000240,00026.51
Table 3. Ablation experiment on the Crowdhuman datasets.
Table 3. Ablation experiment on the Crowdhuman datasets.
BaselineMViTMViT (P2)SCNeckDyHeadDEHeadParam (M)R (%)mAP50 (%)mAP50-95 (%)
YOLOv8n 3.065.975.848.0
YOLOv8n_MViT 1.1864.574.746.4
YOLOv8n_MViT_P2 1.2866.777.148.7
YOLOv8n_MViT_SCNeck 1.9768.879.351.5
YOLOv8n_MViT_DyHead 1.3768.278.550.7
YOLOv8n_MViT_DEHead 1.3569.179.251.5
MSCD-YOLO 2.03(−0.97)70.4(+4.5)80.4(+4.6)53.3(+5.3)
Table 4. Ablation experiment on the Widerperson datasets.
Table 4. Ablation experiment on the Widerperson datasets.
BaselineMViTMViT (P2)SCNeckDyHeadDEHeadParam (M)R (%)mAP50 (%)mAP50-95 (%)
YOLOv8n 3.079.688.362.3
YOLOv8n_MViT 1.1878.488.061.6
YOLOv8n_MViT_P2 1.2880.489.162.9
YOLOv8n_MViT_SCNeck 1.9781.289.864.3
YOLOv8n_MViT_DyHead 1.3780.389.563.8
YOLOv8n_MViT_DEHead 1.3581.289.864.4
MSCD-YOLO 2.03(−0.97)81.5(+1.9)90.1(+1.8)64.9(+2.6)
Table 5. Comparison of different models on the Crowdhuman datasets.
Table 5. Comparison of different models on the Crowdhuman datasets.
MethodParam (M)R (%)mAP50 (%)mAP50-95 (%)
Faster-RCNN41.34 78.049.9
Mask-RCNN43.99 77.047.0
SSD23.746 69.634.7
YOLOv5n1.7660.870.840.4
YOLOv5s7.0267.077.247.2
YOLOv7-Tiny6.0169.4178.5646.22
YOLOv8n(Baseline)3.065.975.848
YOLOv8s11.170.880.153.2
YOLOv9-Tiny2.6165.275.347.9
YOLOv10n2.764.374.846.9
YOLOv10s8.0369.679.552.3
YOLOv11n2.5964.975.447.4
YOLOv11s9.4170.479.852.8
MSCD-YOLO(Ours)2.0370.680.453.3
Table 6. Comparison of different models on the Widerperson datasets.
Table 6. Comparison of different models on the Widerperson datasets.
MethodParam (M)R (%)mAP50 (%)mAP50-95 (%)
Faster-RCNN41.34 86.958.9
Mask-RCNN43.99 86.959.0
SSD23.746 77.943.7
YOLOv5n1.7676.386.857.7
YOLOv5s7.0277.388.260.5
YOLOv7-Tiny6.0180.589.059.5
YOLOv8n(Baseline)3.079.688.362.3
YOLOv8s11.182.290.164.6
YOLOv9-Tiny2.6181.988.662.8
YOLOv10n2.778.587.661.6
YOLOv10s8.0380.489.764.0
YOLOv11n2.5979.388.262.1
YOLOv11s9.4181.389.964.4
MSCD-YOLO(Ours)2.0381.590.164.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Q.; Li, Z.; Zhang, L.; Deng, J. MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors 2025, 25, 438. https://doi.org/10.3390/s25020438

AMA Style

Liu Q, Li Z, Zhang L, Deng J. MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors. 2025; 25(2):438. https://doi.org/10.3390/s25020438

Chicago/Turabian Style

Liu, Qiang, Zhongmin Li, Lei Zhang, and Jin Deng. 2025. "MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction" Sensors 25, no. 2: 438. https://doi.org/10.3390/s25020438

APA Style

Liu, Q., Li, Z., Zhang, L., & Deng, J. (2025). MSCD-YOLO: A Lightweight Dense Pedestrian Detection Model with Finer-Grained Feature Information Interaction. Sensors, 25(2), 438. https://doi.org/10.3390/s25020438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop