[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,131)

Search Parameters:
Keywords = bounding box

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 20081 KiB  
Article
YOLO-ACE: Enhancing YOLO with Augmented Contextual Efficiency for Precision Cotton Weed Detection
by Qi Zhou, Huicheng Li, Zhiling Cai, Yiwen Zhong, Fenglin Zhong, Xiaoyu Lin and Lijin Wang
Sensors 2025, 25(5), 1635; https://doi.org/10.3390/s25051635 - 6 Mar 2025
Viewed by 102
Abstract
Effective weed management is essential for protecting crop yields in cotton production, yet conventional deep learning approaches often falter in detecting small or occluded weeds and can be restricted by large parameter counts. To tackle these challenges, we propose YOLO-ACE, an advanced extension [...] Read more.
Effective weed management is essential for protecting crop yields in cotton production, yet conventional deep learning approaches often falter in detecting small or occluded weeds and can be restricted by large parameter counts. To tackle these challenges, we propose YOLO-ACE, an advanced extension of YOLOv5s, which was selected for its optimal balance of accuracy and speed, making it well suited for agricultural applications. YOLO-ACE integrates a Context Augmentation Module (CAM) and Selective Kernel Attention (SKAttention) to capture multi-scale features and dynamically adjust the receptive field, while a decoupled detection head separates classification from bounding box regression, enhancing overall efficiency. Experiments on the CottonWeedDet12 (CWD12) dataset show that YOLO-ACE achieves notable [email protected] and [email protected]:0.95 scores—95.3% and 89.5%, respectively—surpassing previous benchmarks. Additionally, we tested the model’s transferability and generalization across different crops and environments using the CropWeed dataset, where it achieved a competitive [email protected] of 84.3%, further showcasing its robust ability to adapt to diverse conditions. These results confirm that YOLO-ACE combines precise detection with parameter efficiency, meeting the exacting demands of modern cotton weed management. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) Diagram of the algorithm structure of YOLOv5s; (<b>b</b>) diagram of the algorithm of YOLO-ACE; (<b>c</b>) components in the algorithmic structure.</p>
Full article ">Figure 2
<p>YOLO-ACE module integration flowchart.</p>
Full article ">Figure 3
<p>Network architecture diagram of Context Augmentation Module.</p>
Full article ">Figure 4
<p>Fusion methods of CAM: (<b>a</b>) and (<b>b</b>) show direct feature map integration via weighting and concatenation, respectively, while (<b>c</b>) employs an adaptive fusion—combining convolution, splicing, and softmax—to merge information from three channels.</p>
Full article ">Figure 5
<p>Selective kernel attention.</p>
Full article ">Figure 6
<p>Examples of the 12 categories of weeds in CottonWeedDet12.</p>
Full article ">Figure 7
<p>Convergence curves of YOLO-ACE: (<b>a</b>) mAP at IoU = 0.5; (<b>b</b>) mAP across IoU thresholds from 0.5 to 0.95.</p>
Full article ">Figure 8
<p>Detection Comparative Analysis of Cotton Weed Detection under Challenging Conditions: (<b>a</b>) Weeds exhibiting diverse shapes, sizes, and dimensions; (<b>b</b>) YOLOv5s failing to detect small and occluded targets; (<b>c</b>) YOLO-ACE demonstrating enhanced detection of small and occluded targets.</p>
Full article ">Figure 9
<p>Comparative heatmap visualizations of YOLOv5 variants with module integrations—none, decoupled head (DH), SKAttention (SK), Context Augmentation Module (CAM), and Full Integration (YOLO-ACE).</p>
Full article ">Figure 10
<p>Robust Weed Detection under Variable Conditions: (<b>a</b>–<b>d</b>) present detection outcomes under diverse lighting conditions and viewing angles.</p>
Full article ">Figure 11
<p>Analysis of YOLO-ACE Detection Failures: (<b>a</b>,<b>b</b>) reveal that severe occlusion or overlap can lead to missed weed detections due to inherent ambiguities. (<b>c</b>,<b>d</b>) show that enhanced feature extraction may misclassify subtle plant features as weeds, resulting in false positives.</p>
Full article ">
16 pages, 2124 KiB  
Article
SmartDENM—A System for Enhancing Pedestrian Safety Through Machine Vision and V2X Communication
by Abdulagha Dadashev and Árpád Török
Electronics 2025, 14(5), 1026; https://doi.org/10.3390/electronics14051026 - 4 Mar 2025
Viewed by 168
Abstract
A pivotal moment in the leap toward autonomous vehicles in recent years has revealed the need to enhance vehicle-to-everything (V2X) communication systems so as to improve road safety. A key challenge is to integrate real-time pedestrian detection to permit the use of timely [...] Read more.
A pivotal moment in the leap toward autonomous vehicles in recent years has revealed the need to enhance vehicle-to-everything (V2X) communication systems so as to improve road safety. A key challenge is to integrate real-time pedestrian detection to permit the use of timely alerts in situations where vulnerable road users, especially pedestrians, might pose a risk. Seeing that, in this article, a YOLO-based object detection model was used to identify pedestrians and extract key data such as bounding box coordinates and confidence levels. These data were encoded afterward into decentralized environmental notification messages (DENM) using ASN.1 schemas to ensure compliance with V2X standards, allowing for real-time communication between vehicles and infrastructure. This research identified that the integration of pedestrian detection with V2X communication brought about a reliable system wherein the roadside unit (RSU) broadcasts DENM alerts to vehicles. These vehicles, upon receiving the messages, initiate appropriate responses such as slowing down or lane changing, with the testing demonstrating reliable message transmission and high pedestrian detection accuracy in simulated–controlled environments. To conclude, this work demonstrates a scalable framework for improving road safety by combining machine vision with V2X communication. Full article
Show Figures

Figure 1

Figure 1
<p>Smart I2V system process flowchart for pedestrian detection and vehicle coordination. The system integrates V2X communication protocols, AI-based detection, and vehicle behavior coordination to enhance real-time traffic safety.</p>
Full article ">Figure 2
<p>Detection workflow with YOLOv5: Model loading, inference, and JSON-based result publishing for downstream processing.</p>
Full article ">Figure 3
<p>Four selected tested images: The first two images depict scenes with two individuals present, while the latter two images represent scenarios without any individuals.</p>
Full article ">Figure 4
<p>Scenario: Pedestrian Crossing: Demonstrates the integration of real-time pedestrian detection and V2X communication, showcasing how alerts enable vehicles to adapt their behavior for enhanced safety.</p>
Full article ">
23 pages, 10794 KiB  
Article
Hand–Eye Separation-Based First-Frame Positioning and Follower Tracking Method for Perforating Robotic Arm
by Handuo Zhang, Jun Guo, Chunyan Xu and Bin Zhang
Appl. Sci. 2025, 15(5), 2769; https://doi.org/10.3390/app15052769 - 4 Mar 2025
Viewed by 232
Abstract
In subway tunnel construction, current hand–eye integrated drilling robots use a camera mounted on the drilling arm for image acquisition. However, dust interference and long-distance operation cause a decline in image quality, affecting the stability and accuracy of the visual recognition system. Additionally, [...] Read more.
In subway tunnel construction, current hand–eye integrated drilling robots use a camera mounted on the drilling arm for image acquisition. However, dust interference and long-distance operation cause a decline in image quality, affecting the stability and accuracy of the visual recognition system. Additionally, the computational complexity of high-precision detection models limits deployment on resource-constrained edge devices, such as industrial controllers. To address these challenges, this paper proposes a dual-arm tunnel drilling robot system with hand–eye separation, utilizing the first-frame localization and follower tracking method. The vision arm (“eye”) provides real-time position data to the drilling arm (“hand”), ensuring accurate and efficient operation. The study employs an RFBNet model for initial frame localization, replacing the original VGG16 backbone with ShuffleNet V2. This reduces model parameters by 30% (135.5 MB vs. 146.3 MB) through channel splitting and depthwise separable convolutions to reduce computational complexity. Additionally, the GIoU loss function is introduced to replace the traditional IoU, further optimizing bounding box regression through the calculation of the minimum enclosing box. This resolves the gradient vanishing problem in traditional IoU and improves average precision (AP) by 3.3% (from 0.91 to 0.94). For continuous tracking, a SiamRPN-based algorithm combined with Kalman filtering and PID control ensures robustness against occlusions and nonlinear disturbances, increasing the success rate by 1.6% (0.639 vs. 0.629). Experimental results show that this approach significantly improves tracking accuracy and operational stability, achieving 31 FPS inference speed on edge devices and providing a deployable solution for tunnel construction’s safety and efficiency needs. Full article
Show Figures

Figure 1

Figure 1
<p>Hand–eye separation schematic.</p>
Full article ">Figure 2
<p>Network structure of initial frame positioning model based on improved RFBNet. Schemes follow the same formatting.</p>
Full article ">Figure 3
<p>ShuffleNet V2 block.</p>
Full article ">Figure 4
<p>Comparison of tracking results of the proposed method with the baseline algorithm and the classical algorithm.</p>
Full article ">Figure 5
<p>Tracking and positioning process diagram of drilling robot arm based on SiamRPN.</p>
Full article ">Figure 6
<p>PID control system block diagram.</p>
Full article ">Figure 7
<p>Comparison of initial frame positioning model effect of improved RFBNet.</p>
Full article ">Figure 8
<p>Comparison of success rate and accuracy rate between proposed algorithm and baseline algorithm.</p>
Full article ">Figure 9
<p>Comparison of success rate and accuracy rate between proposed algorithm and classical algorithm.</p>
Full article ">Figure 10
<p>Comparison of accuracy rates between the proposed method and the classical algorithm. Figures are arranged in three columns per row.</p>
Full article ">Figure 11
<p>Comparison of accuracy rates between the proposed method and the classical algorithm. Figures are arranged in three columns per row.</p>
Full article ">Figure 12
<p>Comparison of tracking results of the proposed method with the baseline algorithm and the classical algorithm.</p>
Full article ">
25 pages, 152810 KiB  
Article
QEDetr: DETR with Query Enhancement for Fine-Grained Object Detection
by Chenguang Dong, Shan Jiang, Haijiang Sun, Jiang Li, Zhenglei Yu, Jiasong Wang and Jiacheng Wang
Remote Sens. 2025, 17(5), 893; https://doi.org/10.3390/rs17050893 - 3 Mar 2025
Viewed by 261
Abstract
Fine-grained object detection aims to accurately localize the object bounding box while identifying the specific model of the object, which is more challenging than conventional remote sensing object detection. Transformer-based object detector (DETR) can capture remote inter-feature dependencies by using attention, which is [...] Read more.
Fine-grained object detection aims to accurately localize the object bounding box while identifying the specific model of the object, which is more challenging than conventional remote sensing object detection. Transformer-based object detector (DETR) can capture remote inter-feature dependencies by using attention, which is suitable for fine-grained object detection tasks. However, most existing DETR-like object detectors are not specifically optimized for remote sensing detection tasks. Therefore, we propose an oriented fine-grained object detection method based on transformers. First, we combine denoising training and angle coding to propose a baseline DETR-like object detector for oriented object detection. Next, we propose a new attention mechanism for extracting finer-grained features by constraining the angle of sampling points during the attentional process, ensuring that the sampling points are more evenly distributed across the object features. Then, we propose a multiscale fusion method based on bilinear pooling to obtain the enhanced query and initialize a more accurate object bounding box. Finally, we combine the localization accuracy of each query with its classification accuracy and propose a new classification loss to further enhance the high-quality queries. Evaluation results on the FAIR1M dataset show that our method achieves an average accuracy of 48.5856 mAP and the highest accuracy of 49.7352 mAP in object detection, outperforming other methods. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>Overall structure of QEDetr: (<b>a</b>) extraction of input image features and inputting to encoder uses the same process as Deformable DETR; (<b>b</b>) multiscale fusion of feature maps and screening of top-k queries; (<b>c</b>) the RADA module used for the decoder process generates the categories layer-by-layer, iteratively refining the reference box and angle; (<b>d</b>) QEDetr uses the regression and IoU losses for the reference box and angle along with our proposed AIL for the categorization loss.</p>
Full article ">Figure 2
<p>Decoder of QEDetr, showing the whole process of the decoder and its iterative layer-by-layer refinement of the reference box and angle coding.</p>
Full article ">Figure 3
<p>Demonstration of the alignment process for each type of attention: (<b>a</b>) the alignment process for deformable attention, which does not include sample point rotations; (<b>b</b>) the alignment process for RDA, which includes sample point rotations but does not take into account the shape distribution of the sample points themselves; and (<b>c</b>) the alignment process for RADA, which we present here.</p>
Full article ">Figure 4
<p>In multiscale bilinear fusion, a shared MLP is used to compute the foreground scores for each scale feature map, then the high-level feature map and foreground scores are upsampled and fused with the low-level information, and the output features and foregrounds from each layer are finally spliced into the outputs.</p>
Full article ">Figure 5
<p>Number of instances per category in the FAIR1M2.0 dataset after multiscale cropping.</p>
Full article ">Figure 6
<p>Loss curve maps of QEDetr.</p>
Full article ">Figure 7
<p>Results on the FAIR1M dataset.</p>
Full article ">Figure 8
<p>Comparison visualizing the results of different object detection algorithms on FAIR1M: (<b>a</b>) baseline (DN+PSC), (<b>b</b>) ARSDetr, and (<b>c</b>) QEDetr.</p>
Full article ">Figure 9
<p>Heatmap visualization of the backbone.</p>
Full article ">Figure 10
<p>Sample point visualization results: (<b>a</b>) deformable attention and (<b>b</b>) RADA. The colors of the sampled points represent their weights in the attention process.</p>
Full article ">
19 pages, 7601 KiB  
Article
Mixture of Expert-Based SoftMax-Weighted Box Fusion for Robust Lesion Detection in Ultrasound Imaging
by Se-Yeol Rhyou, Minyung Yu and Jae-Chern Yoo
Diagnostics 2025, 15(5), 588; https://doi.org/10.3390/diagnostics15050588 - 28 Feb 2025
Viewed by 207
Abstract
Background/Objectives: Ultrasound (US) imaging plays a crucial role in the early detection and treatment of hepatocellular carcinoma (HCC). However, challenges such as speckle noise, low contrast, and diverse lesion morphology hinder its diagnostic accuracy. Methods: To address these issues, we propose CSM-FusionNet, a [...] Read more.
Background/Objectives: Ultrasound (US) imaging plays a crucial role in the early detection and treatment of hepatocellular carcinoma (HCC). However, challenges such as speckle noise, low contrast, and diverse lesion morphology hinder its diagnostic accuracy. Methods: To address these issues, we propose CSM-FusionNet, a novel framework that integrates clustering, SoftMax-weighted Box Fusion (SM-WBF), and padding. Using raw US images from a leading hospital, Samsung Medical Center (SMC), we applied intensity adjustment, adaptive histogram equalization, low-pass, and high-pass filters to reduce noise and enhance resolution. Data augmentation generated ten images per one raw US image, allowing the training of 10 YOLOv8 networks. The [email protected] of each network was used as SoftMax-derived weights in SM-WBF. Threshold-lowered bounding boxes were clustered using Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and outliers were managed within clusters. SM-WBF reduced redundant boxes, and padding enriched features, improving classification accuracy. Results: The accuracy improved from 82.48% to 97.58% with sensitivity reaching 100%. The framework increased lesion detection accuracy from 56.11% to 95.56% after clustering and SM-WBF. Conclusions: CSM-FusionNet demonstrates the potential to significantly improve diagnostic reliability in US-based lesion detection, aiding precise clinical decision-making. Full article
(This article belongs to the Special Issue Advances in Medical Image Processing, Segmentation and Classification)
Show Figures

Figure 1

Figure 1
<p>Example of US images from dataset. (<b>a</b>) Benign, (<b>b</b>) Malignant.</p>
Full article ">Figure 2
<p>Flowchart of our proposed network: CSM-FusionNet.</p>
Full article ">Figure 3
<p>The original US image serves as the input image, and through the application of four distinct filters combined with alpha and beta values, a total of ten augmented images are generated.</p>
Full article ">Figure 4
<p>Lesion suspected regions, green bounding box, detected by YOLOv8.</p>
Full article ">Figure 5
<p>Determination of bounding boxes using clustering, SM-WBF, and padding. (<b>a</b>) All bounding boxes detected by the ten networks. (<b>b</b>) Clustering of bounding boxes into four regions using DBSCAN. (<b>c</b>) Application of SM-WBF with SoftMax weights to the clustered regions. (<b>d</b>) Addition of padding to the bounding boxes in (<b>c</b>).</p>
Full article ">Figure 6
<p>Method for measuring lesion detection accuracy. Among the four detected bounding boxes, at least one box has an IoU of 0.9 or higher with the ground truth, and, therefore, the detection is considered successful.</p>
Full article ">Figure 7
<p>(<b>a</b>) An image with all bounding boxes detected by YOLOv8 overlaid, and the results of bounding box optimization through (<b>b</b>) clustering, (<b>c</b>) SM-WBF, and (<b>d</b>) padding. (<b>e</b>) Ground truth bounding box.</p>
Full article ">Figure 8
<p>The lesions were classified into three classes using EfficientNet-b0.</p>
Full article ">
22 pages, 52708 KiB  
Article
CSMR: A Multi-Modal Registered Dataset for Complex Scenarios
by Chenrui Li, Kun Gao, Zibo Hu, Zhijia Yang, Mingfeng Cai, Haobo Cheng and Zhenyu Zhu
Remote Sens. 2025, 17(5), 844; https://doi.org/10.3390/rs17050844 - 27 Feb 2025
Viewed by 159
Abstract
Complex scenarios pose challenges to tasks in computer vision, including image fusion, object detection, and image-to-image translation. On the one hand, complex scenarios involve fluctuating weather or lighting conditions, where even images of the same scenarios appear to be different. On the other [...] Read more.
Complex scenarios pose challenges to tasks in computer vision, including image fusion, object detection, and image-to-image translation. On the one hand, complex scenarios involve fluctuating weather or lighting conditions, where even images of the same scenarios appear to be different. On the other hand, the large amount of textural detail in the given images introduces considerable interference that can conceal the useful information contained in them. An effective solution to these problems is to use the complementary details present in multi-modal images, such as visible-light and infrared images. Visible-light images contain rich textural information while infrared images contain information about the temperature. In this study, we propose a multi-modal registered dataset for complex scenarios under various environmental conditions, targeting security surveillance and the monitoring of low-slow-small targets. Our dataset contains 30,819 images, where the targets are labeled as three classes of “person”, “car”, and “drone” using Yolo format bounding boxes. We compared our dataset with those used in the literature for computer vision-related tasks, including image fusion, object detection, and image-to-image translation. The results showed that introducing complementary information through image fusion can compensate for missing details in the original images, and we also revealed the limitations of visual tasks in single-modal images with complex scenarios. Full article
(This article belongs to the Special Issue Recent Advances in Infrared Target Detection)
Show Figures

Figure 1

Figure 1
<p>The structure of our CSMR dataset. The original images first need to be registered and labeled. We test our dataset on three visual tasks: image fusion, object detection, and image-to-image translation.</p>
Full article ">Figure 2
<p>Examples of the related datasets. The first row corresponds to the visible-light images and the second row corresponds to the infrared images. From left to right: (<b>a</b>) TNO, (<b>b</b>) OSU, (<b>c</b>) CVC-14, (<b>d</b>) KAIST, (<b>e</b>) FLIR, (<b>f</b>) LLVIP.</p>
Full article ">Figure 3
<p>Our collection equipment contains a binocular camera platform and a portable computer.</p>
Full article ">Figure 4
<p>Cameras for image collection.</p>
Full article ">Figure 5
<p>Examples of the scenarios. The first row corresponds to the visible-light images and the second row corresponds to the infrared images. From left to right: (<b>a</b>) street at night, (<b>b</b>) intersection from top-down perspective, (<b>c</b>) disguised person in field, (<b>d</b>) waterside, (<b>e</b>) drone in sky.</p>
Full article ">Figure 6
<p>Registration result of our dataset.</p>
Full article ">Figure 7
<p>Examples of drones in our CSMR dataset. The images in the first row are visible-light images. The images in the second row are infrared images.</p>
Full article ">Figure 8
<p>Examples of fusion algorithms on our CSMR dataset.</p>
Full article ">Figure 9
<p>Examples of “erratic temperature”. From left to right: a single car; two different-colored cars; two same-colored cars.</p>
Full article ">Figure 10
<p>Examples of pedestrian and car detection on our CSMR dataset.</p>
Full article ">Figure 11
<p>Examples of failure cases.</p>
Full article ">Figure 12
<p>Examples of drone detection on our CSMR dataset.</p>
Full article ">Figure 13
<p>Examples of image-to-image translation results on our CSMR dataset.</p>
Full article ">
25 pages, 20488 KiB  
Article
SAR Small Ship Detection Based on Enhanced YOLO Network
by Tianyue Guan, Sheng Chang, Chunle Wang and Xiaoxue Jia
Remote Sens. 2025, 17(5), 839; https://doi.org/10.3390/rs17050839 - 27 Feb 2025
Viewed by 168
Abstract
Ships are important targets for marine surveillance in both military and civilian domains. Since the rise of deep learning, ship detection in synthetic aperture radar (SAR) images has achieved significant progress. However, the variability in ship size and resolution, especially the widespread presence [...] Read more.
Ships are important targets for marine surveillance in both military and civilian domains. Since the rise of deep learning, ship detection in synthetic aperture radar (SAR) images has achieved significant progress. However, the variability in ship size and resolution, especially the widespread presence of numerous small-sized ships, continues to pose challenges for effective ship detection in SAR images. To address the challenges posed by small ship targets, we propose an enhanced YOLO network to improve the detection accuracy of small targets. Firstly, we propose a Shuffle Re-parameterization (SR) module as a replacement for the C2f module in the original YOLOv8 network. The SR module employs re-parameterized convolution along with channel shuffle operations to improve feature extraction capabilities. Secondly, we employ the space-to-depth (SPD) module to perform down-sampling operations within the backbone network, thereby reducing the information loss associated with pooling operations. Thirdly, we incorporate a Hybrid Attention (HA) module into the neck network to enhance the feature representation of small ship targets while mitigating the interference caused by surrounding sea clutter and speckle noise. Finally, we add the shape-NWD loss to the regression loss, which emphasizes the shape and scale of the bounding box and mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in small ship targets. Extensive experiments were carried out on three publicly available datasets—namely, LS-SSDD, HRSID, and iVision-MRSSD—to demonstrate the effectiveness and reliability of the proposed method. In the small ship dataset LS-SSDD, the proposed method exhibits a notable improvement in average precision at an IoU threshold of 0.5 (AP50), surpassing the baseline network by over 4%, and achieving an AP50 of 77.2%. In the HRSID and iVision-MRSSD datasets, AP50 reaches 91% and 95%, respectively. Additionally, the average precision for small targets (AP) exhibits an increase of approximately 2% across both datasets. Furthermore, the proposed method demonstrates outstanding performance in comparison experiments across all three datasets, outperforming existing state-of-the-art target detection methods. The experimental results offer compelling evidence supporting the superior performance and practical applicability of the proposed method in SAR small ship detection. Full article
Show Figures

Figure 1

Figure 1
<p>The overall architecture of the proposed method.</p>
Full article ">Figure 2
<p>The structure of re-parameterized convolution block (RepConv).</p>
Full article ">Figure 3
<p>(<b>a</b>) The structure of Shuffle Re-parameterization block (SRB). (<b>b</b>) The structure of Shuffle Re-parameterization module (SR).</p>
Full article ">Figure 4
<p>The structure of Space-to-Depth module. (<b>a</b>) is the structure of SPD layer. The character C represents the feature map’s channel numbers, and the character s denotes the width and height of the input feature map. After the SPD layer, both the height and width of the feature map are reduced by half, while the number of channels is increased fourfold. (<b>b</b>) is the SPD module, which consists of the SPD layer. (SPD in the structure diagram (<b>b</b>) refers to SPD layer).</p>
Full article ">Figure 5
<p>The structure of Hybrid Attention (HA). (<b>a</b>) is the Multi-axis External Weights module (MEW). (<b>b</b>) is the Spatial Attention module (SA). (<b>c</b>) is the Hybrid Attention (HA) module, which consists of the MEW module and the SA module.</p>
Full article ">Figure 6
<p>Precision–recall curves when adding different modules.</p>
Full article ">Figure 7
<p>Visualization of the detection results. (<b>a</b>,<b>d</b>,<b>g</b>) represent the ground truth. (<b>b</b>,<b>e</b>,<b>h</b>) represent the baseline. (<b>c</b>,<b>f</b>,<b>i</b>) represent the proposed method. Green: ground truths. Yellow: detection results. Red: missed detections. Blue: false alarms.</p>
Full article ">Figure 8
<p>Visualization of the detection results. (<b>a</b>,<b>d</b>,<b>g</b>,<b>j</b>,<b>m</b>) represent the ground truth. (<b>b</b>,<b>e</b>,<b>h</b>,<b>k</b>,<b>n</b>) represent the baseline. (<b>c</b>,<b>f</b>,<b>i</b>,<b>l</b>,<b>o</b>) represent the proposed method. Green: ground truths. Yellow: detection results. Red: missed detections. Blue: false alarms.</p>
Full article ">Figure 9
<p>Visualization of the detection results of various methods on LS-SSDD (<b>a</b>)–(<b>l</b>). (<b>a</b>) Ground truth. (<b>b</b>) Faster R-CNN. (<b>c</b>) CenterNet. (<b>d</b>) FCOS. (<b>e</b>) ATSS. (<b>f</b>) YOLOv5n. (<b>g</b>) YOLOv8n. (<b>h</b>) YOLOv10n. (<b>i</b>) YOLOv11n. (<b>j</b>) SHIP-YOLO. (<b>k</b>) LHSDNet. (<b>l</b>) The proposed method. The green boxes indicate the ground truths; the red boxes indicate detection results.</p>
Full article ">Figure 9 Cont.
<p>Visualization of the detection results of various methods on LS-SSDD (<b>a</b>)–(<b>l</b>). (<b>a</b>) Ground truth. (<b>b</b>) Faster R-CNN. (<b>c</b>) CenterNet. (<b>d</b>) FCOS. (<b>e</b>) ATSS. (<b>f</b>) YOLOv5n. (<b>g</b>) YOLOv8n. (<b>h</b>) YOLOv10n. (<b>i</b>) YOLOv11n. (<b>j</b>) SHIP-YOLO. (<b>k</b>) LHSDNet. (<b>l</b>) The proposed method. The green boxes indicate the ground truths; the red boxes indicate detection results.</p>
Full article ">Figure 9 Cont.
<p>Visualization of the detection results of various methods on LS-SSDD (<b>a</b>)–(<b>l</b>). (<b>a</b>) Ground truth. (<b>b</b>) Faster R-CNN. (<b>c</b>) CenterNet. (<b>d</b>) FCOS. (<b>e</b>) ATSS. (<b>f</b>) YOLOv5n. (<b>g</b>) YOLOv8n. (<b>h</b>) YOLOv10n. (<b>i</b>) YOLOv11n. (<b>j</b>) SHIP-YOLO. (<b>k</b>) LHSDNet. (<b>l</b>) The proposed method. The green boxes indicate the ground truths; the red boxes indicate detection results.</p>
Full article ">
19 pages, 36008 KiB  
Article
An Enhanced Algorithm for Detecting Small Traffic Signs Using YOLOv10
by Hongrui Liu, Ke Wang, Yudi Wang, Ming Zhang, Qinghua Liu and Wentao Li
Electronics 2025, 14(5), 955; https://doi.org/10.3390/electronics14050955 - 27 Feb 2025
Viewed by 223
Abstract
Recognizing traffic signs is crucial for autonomous driving systems, as it significantly impacts their safety and dependability. However, challenges like the diminutive size of objects and intricate background environments limit the effectiveness of current object detection models. To improve small traffic sign detection, [...] Read more.
Recognizing traffic signs is crucial for autonomous driving systems, as it significantly impacts their safety and dependability. However, challenges like the diminutive size of objects and intricate background environments limit the effectiveness of current object detection models. To improve small traffic sign detection, this research introduces an enhanced detection algorithm built on YOLOv10. First, a custom-designed layer for detecting small objects is integrated into the neck section of the network, enhancing the feature extraction process for these objects. Second, a refined downsampling module, called Triple-Branch Downsampling (TBD), utilizes a multi-branch structure and hybrid pooling strategy to boost feature extraction efficiency within the model. Finally, the loss function is optimized by integrating the Normalized Wasserstein Distance (NWD) and Wise-MPDIoU mechanisms, increasing the accuracy of bounding box matching and regression. The experimental findings indicate that the enhanced algorithm reaches a [email protected] of 84.8%, marking a 4% increase over YOLOv10. The classification accuracy and recall are 73.4% and 82.9%, respectively. Moreover, the parameter count decreases by approximately 10%, while the computational complexity is reduced by around 5%. Full article
Show Figures

Figure 1

Figure 1
<p>Overall architecture of YOLOv10.</p>
Full article ">Figure 2
<p>Overall architecture of the improved network model.</p>
Full article ">Figure 3
<p>Small-object detection layer structure.</p>
Full article ">Figure 4
<p>Diagram of the TBD module structure.</p>
Full article ">Figure 5
<p>IoU sensitivity evaluation for objects at tiny and normal scales. (<b>a</b>) Zoomed-in view of the positional offset on a small-scale object; (<b>b</b>) zoomed-in view of the positional offset on a Medium-scale object.</p>
Full article ">Figure 6
<p>Traffic sign classifications included in the dataset. (<b>a</b>) Samples from the instruction class, (<b>b</b>) samples from the prohibit class, which include traffic signs in Chinese characters indicating to slow down and yield, and (<b>c</b>) samples from the warning class.</p>
Full article ">Figure 7
<p>Dataset labeling information. (<b>a</b>) Dataset category distribution. (<b>b</b>) Distribution of target positions. (<b>c</b>) Distribution of target sizes.</p>
Full article ">Figure 8
<p>Performance curve comparison chart.</p>
Full article ">Figure 9
<p>Comparison of practical road environments. The left column displays the original road scene images without detection, the middle column illustrates the detection results from the YOLOv10 algorithm, and the right column highlights the outcomes identified by the improved algorithm proposed in this study.</p>
Full article ">
23 pages, 26465 KiB  
Article
DHS-YOLO: Enhanced Detection of Slender Wheat Seedlings Under Dynamic Illumination Conditions
by Xuhua Dong and Jingbang Pan
Agriculture 2025, 15(5), 510; https://doi.org/10.3390/agriculture15050510 - 26 Feb 2025
Viewed by 292
Abstract
The precise identification of wheat seedlings in unmanned aerial vehicle (UAV) imagery is fundamental for implementing precision agricultural practices such as targeted pesticide application and irrigation management. This detection task presents significant technical challenges due to two inherent complexities: (1) environmental interference from [...] Read more.
The precise identification of wheat seedlings in unmanned aerial vehicle (UAV) imagery is fundamental for implementing precision agricultural practices such as targeted pesticide application and irrigation management. This detection task presents significant technical challenges due to two inherent complexities: (1) environmental interference from variable illumination conditions and (2) morphological characteristics of wheat seedlings characterized by slender leaf structures and flexible posture variations. To address these challenges, we propose DHS-YOLO, a novel deep learning framework optimized for robust wheat seedling detection under diverse illumination intensities. Our methodology builds upon the YOLOv11 architecture with three principal enhancements: First, the Dynamic Slender Convolution (DSC) module employs deformable convolutions to adaptively capture the elongated morphological features of wheat leaves. Second, the Histogram Transformer (HT) module integrates a dynamic-range spatial attention mechanism to mitigate illumination-induced image degradation. Third, we implement the ShapeIoU loss function that prioritizes geometric consistency between predicted and ground truth bounding boxes, particularly optimizing for slender plant structures. The experimental validation was conducted using a custom UAV-captured dataset containing wheat seedling images under varying illumination conditions. Compared to the existing models, the proposed model achieved the best performance with precision, recall, mAP50, and mAP50-95 values of 94.1%, 91.0%, 95.2%, and 81.9%, respectively. These results demonstrate our model’s effectiveness in overcoming illumination variations while maintaining high sensitivity to fine plant structures. This research contributes an optimized computer vision solution for precision agriculture applications, particularly enabling automated field management systems through reliable crop detection in challenging environmental conditions. Full article
(This article belongs to the Special Issue Computational, AI and IT Solutions Helping Agriculture)
Show Figures

Figure 1

Figure 1
<p>Wheat seedling dataset at the six different illumination intensities (II), i.e., (<b>a</b>) II level #1 (II ≤ 0.2k Lux), (<b>b</b>) II level #2 (0.2k Lux &lt; II ≤ 0.5k Lux), (<b>c</b>) II level #3 (0.5k Lux &lt; II ≤ 10k Lux), (<b>d</b>) II level #4 (10k Lux &lt; II ≤ 40k Lux), (<b>e</b>) II level #5 (40Lux &lt; II ≤ 100k Lux), and (<b>f</b>) II level #6-Shadow (10k &lt; II ≤ 100k, with over 50% shadow area), where Lux is the universal unit of II and k means 1000. The image data in (<b>a</b>,<b>b</b>) were collected when the sky was overcast, while (<b>c</b>) was collected when it was cloudy. The image data in (<b>d</b>–<b>f</b>) were collected on a sunny day, with a lot of shadows in (<b>f</b>).</p>
Full article ">Figure 2
<p>Overall structure of DHS-YOLO.</p>
Full article ">Figure 3
<p>Dynamic Slender Convolution (DSC) module, which learns the deformation according to the input feature map, adaptively focuses on the slender local features of the wheat leaves under the knowledge of the slender structure morphology.</p>
Full article ">Figure 4
<p>DSC block.</p>
Full article ">Figure 5
<p>Architecture of our Histogram Transformer module for illumination removal. The main components are the Dynamic-range Histogram Self-Attention (DHSA) part and the Dual-scale Gated Feed-Forward (DGFF) part. There are two types of reshaping mechanisms in DHSA, such as Bin-wise Histogram Reshaping (BHR) and Frequency-wise Histogram Reshaping (FHR).</p>
Full article ">Figure 6
<p>HT block.</p>
Full article ">Figure 7
<p>Graphical representation for the introduction of ShapeIoU loss.</p>
Full article ">Figure 8
<p>Heatmap of YOLOv11 with different modules in different illumination intensities. From left to right, the columns represent the original images, heatmap of original YOLOv11, heatmap of YOLOv11 with DSC modules, heatmap of YOLOv11 with HT modules, and heatmap of our model (YOLOv11 with DSC and HT modules), respectively. From top to bottom, the rows correspond to illumination intensities of II level #1, #2, #3, #4, and #5, respectively. The heatmap uses a rainbow color coding scheme, with a gradient of color bands from blue (low active degree value) to red (high active degree value).</p>
Full article ">Figure 9
<p>Comparison of wheat seedling detection results of different models using the prediction image dataset under six different illumination conditions. From left to right, the columns represent the GT (real bounding box of wheat seedlings), the detection results of YOLOv11, WAS-YOLO, and our proposed algorithm, respectively. From top to bottom, the rows represent the six different illumination conditions in <a href="#agriculture-15-00510-t001" class="html-table">Table 1</a>. It is worth noting that the blue boxes denote the detection results, the detection count is displayed in the bottom left corner of the image, e.g., DetNum: 24, and regions with markers (A, B, and C) indicate incorrect detection areas. Marker A means over-detection, marker B represents missed detection, and marker C means the incorrect detection box with low IoU.</p>
Full article ">Figure 10
<p>Comparison of wheat seedling detection results of different models under four different density conditions. From left to right, the columns represent the GT (real bounding box of wheat seedlings), the detection results of YOLOv11, Deformable DERT, and our proposed algorithm, respectively. From top to bottom, the rows represent the four different density conditions of wheat seedling distribution (density levels #1, #2, #3, and #4). It is worth noting that the blue boxes denote the detection results, the detection count is displayed in the bottom left corner of the image, e.g., DetNum: 38, and regions with markers (A, B, and C) indicate incorrect detection areas. Marker A means over-detection, marker B represents missed detection, and marker C means the incorrect detection box with low IoU.</p>
Full article ">
26 pages, 13085 KiB  
Article
Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning
by Dongjin Hwang, Jae-Jun Kim, Sungkon Moon and Seunghyeon Wang
Appl. Sci. 2025, 15(5), 2525; https://doi.org/10.3390/app15052525 - 26 Feb 2025
Viewed by 192
Abstract
There are numerous applications for building dimension data, including building performance simulation and urban heat island investigations. In this context, object detection and instance segmentation methods—based on deep learning—are often used with Street View Images (SVIs) to estimate building dimensions. However, these methods [...] Read more.
There are numerous applications for building dimension data, including building performance simulation and urban heat island investigations. In this context, object detection and instance segmentation methods—based on deep learning—are often used with Street View Images (SVIs) to estimate building dimensions. However, these methods typically depend on large and diverse datasets. Image augmentation can artificially boost dataset diversity, yet its role in building dimension estimation from SVIs remains under-studied. This research presents a methodology that applies eight distinct augmentation techniques—brightness, contrast, perspective, rotation, scale, shearing, translation augmentation, and a combined “sum of all” approach—to train models in two tasks: object detection with Faster Region-Based Convolutional Neural Networks (Faster R-CNNs) and instance segmentation with You Only Look Once (YOLO)v10. Comparing the performance with and without augmentation revealed that contrast augmentation consistently provided the greatest improvement in both bounding-box detection and instance segmentation. Using all augmentations at once rarely outperformed the single most effective method, and sometimes degraded the accuracy; shearing augmentation ranked as the second-best approach. Notably, the validation and test findings were closely aligned. These results, alongside the potential applications and the method’s current limitations, underscore the importance of carefully selected augmentations for reliable building dimension estimation. Full article
Show Figures

Figure 1

Figure 1
<p>Workflow of proposed methods.</p>
Full article ">Figure 2
<p>Examples of images with brightness augmentation.</p>
Full article ">Figure 3
<p>Examples of images with contrast augmentation.</p>
Full article ">Figure 4
<p>Examples of images with scale augmentation.</p>
Full article ">Figure 5
<p>Examples of images with perspective augmentation.</p>
Full article ">Figure 6
<p>Examples of images with rotation augmentation.</p>
Full article ">Figure 7
<p>Examples of images with translation augmentation.</p>
Full article ">Figure 8
<p>Examples of images with shearing augmentation.</p>
Full article ">Figure 9
<p>Workflow of Faster R-CNN.</p>
Full article ">Figure 10
<p>Workflow of YOLOv10.</p>
Full article ">Figure 11
<p>Geographic scope of data collection area.</p>
Full article ">Figure 12
<p>Examples images in London.</p>
Full article ">Figure 13
<p>Examples of unusable images.</p>
Full article ">Figure 14
<p>Examples of augmented images.</p>
Full article ">Figure 15
<p>Examples of labeling for object detection and instance segmentation.</p>
Full article ">Figure 16
<p>Difference between each technique and baseline (AP50).</p>
Full article ">Figure 17
<p>Difference between each technique and baseline (AP50:95).</p>
Full article ">Figure 17 Cont.
<p>Difference between each technique and baseline (AP50:95).</p>
Full article ">Figure 18
<p>Difference between each technique and baseline (IOU).</p>
Full article ">Figure 19
<p>Validation results for each method.</p>
Full article ">Figure 20
<p>Results of testing validation and test sets in best model.</p>
Full article ">Figure 21
<p>Examples of visual results from Grad-CAM analysis.</p>
Full article ">
22 pages, 2410 KiB  
Article
DAHD-YOLO: A New High Robustness and Real-Time Method for Smoking Detection
by Jianfei Zhang and Chengwei Jiang
Sensors 2025, 25(5), 1433; https://doi.org/10.3390/s25051433 - 26 Feb 2025
Viewed by 154
Abstract
Recent advancements in AI technologies have driven the extensive adoption of deep learning architectures for recognizing human behavioral patterns. However, the existing smoking behavior detection models based on object detection still have problems, including poor accuracy and insufficient real-time performance. Especially in complex [...] Read more.
Recent advancements in AI technologies have driven the extensive adoption of deep learning architectures for recognizing human behavioral patterns. However, the existing smoking behavior detection models based on object detection still have problems, including poor accuracy and insufficient real-time performance. Especially in complex environments, the existing models often struggle with erroneous detections and missed detections. In this paper, we introduce DAHD-YOLO, a model built upon the foundation of YOLOv8. We first designed the DBCA module to replace the bottleneck component in the backbone. The architecture integrates a diverse branch block and a contextual anchor mechanism, effectively improving the backbone network’s ability to extract features. Subsequently, at the end of the backbone, we introduce adaptive fine-grained channel attention (AFGCA) to effectively facilitate the fusion of both overarching patterns and localized details. We introduce the ECA-FPN, an improved version of the feature pyramid network, designed to refine the extraction of hierarchical information and enhance cross-scale feature interactions. The decoupled detection head is also updated via the reparameterization approach. The wise–powerful intersection over union (Wise-PIoU) is adopted as the new bounding box regression loss function, resulting in quicker convergence speed and improved detection outcomes. Our system achieves superior results compared to existing models using a self-constructed smoking detection dataset, reducing computational complexity by 23.20% while trimming the model parameters by 33.95%. Moreover, the mAP50 of our model has increased by 5.1% compared to the benchmark model, reaching 86.0%. Finally, we deploy the improved model on the RK3588. After optimizations such as quantization and multi-threading, the system achieves a detection rate of 50.2 fps, addressing practical application demands and facilitating the precise and instantaneous identification of smoking activities. Full article
Show Figures

Figure 1

Figure 1
<p>The architecture of DAHD-YOLO.</p>
Full article ">Figure 2
<p>The component diagram of the DBCA module, including two sub-diagrams, namely ConvDBB and CAA.</p>
Full article ">Figure 3
<p>Diagram of AFGCA.</p>
Full article ">Figure 4
<p>Diagram of ECA-FPN.</p>
Full article ">Figure 5
<p>Regression results guided by different BBR losses.</p>
Full article ">Figure 6
<p>Comparison of feature visualization after adding different attention mechanisms in FPN.</p>
Full article ">Figure 7
<p>Comparison of convergence speeds between Wise-PIoU and traditional loss functions.</p>
Full article ">Figure 8
<p>Comparison of heatmap effects: (<b>a</b>) YOLOv8 base model; (<b>b</b>) improved model with AFGCA.</p>
Full article ">Figure 9
<p>Comparison of the detection effects of the original model and the improved model in complex scenarios.</p>
Full article ">
22 pages, 3970 KiB  
Article
YOLO-ALW: An Enhanced High-Precision Model for Chili Maturity Detection
by Yi Wang, Cheng Ouyang, Hao Peng, Jingtao Deng, Lin Yang, Hailin Chen, Yahui Luo and Ping Jiang
Sensors 2025, 25(5), 1405; https://doi.org/10.3390/s25051405 - 25 Feb 2025
Viewed by 239
Abstract
Chili pepper, a widely cultivated and consumed crop, faces challenges in accurately determining maturity due to issues such as occlusion, small target size, and similarity between fruit color and background. This study presents an enhanced YOLOv8n-based object detection model, YOLO-ALW, designed to address [...] Read more.
Chili pepper, a widely cultivated and consumed crop, faces challenges in accurately determining maturity due to issues such as occlusion, small target size, and similarity between fruit color and background. This study presents an enhanced YOLOv8n-based object detection model, YOLO-ALW, designed to address these challenges. The model introduces the AKConv (Alterable Kernel Convolution) module in the head section, which adaptively adjusts the convolution kernel shape and size based on the target and scene, improving detection performance under occlusion and dense environments. In the backbone, the SPPF_LSKA (Spatial Pyramid Pooling Fast-Large Separable Kernel Attention) module enhances the integration of multi-scale features, facilitating accurate differentiation of peppers at various maturity stages while maintaining low computational complexity. Additionally, the Wise-IoU (Wise Intersection over Union) loss function optimizes bounding box learning, further improving the detection of peppers in occluded or background-similar scenarios. Experimental results demonstrate that YOLO-ALW achieves a mean average precision (mAP0.5) of 99.1%, with precision and recall rates of 98.3% and 97.8%, respectively, outperforming the original YOLOv8n by 3.4%, 5.1%, and 9.0%, respectively. Grad-CAM feature visualization highlights the model’s improved focus on key fruit features. YOLO-ALW shows significant promise for high-precision chili pepper detection and maturity recognition, offering valuable support for automated harvesting applications. Full article
(This article belongs to the Section Smart Agriculture)
Show Figures

Figure 1

Figure 1
<p>Chili pepper images under different shooting angles and weather conditions.</p>
Full article ">Figure 2
<p>Categorization of maturity.</p>
Full article ">Figure 3
<p>Data presentation in different environmental conditions.</p>
Full article ">Figure 4
<p>The network structure of YOLO-ALW.</p>
Full article ">Figure 5
<p>Schematic diagram of the AKConv module.</p>
Full article ">Figure 6
<p>Structural diagram of the LSKA module.</p>
Full article ">Figure 7
<p>Structural diagram of the SPPF_LSKA module.</p>
Full article ">Figure 8
<p>Model training and validation metrics.</p>
Full article ">Figure 9
<p>Comparison of model detection results.</p>
Full article ">Figure 10
<p>Comparison of model heatmap results.</p>
Full article ">
33 pages, 23014 KiB  
Article
Underwater Target Tracking Method Based on Forward-Looking Sonar Data
by Wenjing Zeng, Renzhe Li, Heng Zhou and Tiedong Zhang
J. Mar. Sci. Eng. 2025, 13(3), 430; https://doi.org/10.3390/jmse13030430 - 25 Feb 2025
Viewed by 231
Abstract
Underwater dynamic targets often display significant blurriness in their forward-looking sonar imagery, accompanied by sparse feature representation. This phenomenon presents several challenges, including disturbances in the trajectories of underwater targets and alterations in target identification throughout the tracking process, thereby complicating the continuous [...] Read more.
Underwater dynamic targets often display significant blurriness in their forward-looking sonar imagery, accompanied by sparse feature representation. This phenomenon presents several challenges, including disturbances in the trajectories of underwater targets and alterations in target identification throughout the tracking process, thereby complicating the continuous monitoring of moving targets. This research proposes a new framework for underwater acoustic data interpolation and underwater object tracking. Considering the character of underwater acoustic images, a Swin Transformer is integrated in the architecture of the YOLOv5 network; then, an improved Deep Simple Online and Real-time Tracking method is developed. By enlarging the bounding box output generated by the detector and subsequently integrating it into the tracker, the sensing horizon of the tracker is broadened. This strategy enables the extraction of noise features surrounding the target, thereby augmenting the target’s characteristics and improving the stability of the tracking process. The experimental results demonstrate that the proposed method effectively reduces the frequency of changes in target identification numbers, minimizes the occurrence of trajectory interruptions, and decreases the overall percentage of trajectory interruptions. Additionally, it significantly enhances tracking stability, particularly in scenarios involving intersecting target paths and encounters. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Figure 1

Figure 1
<p>Diagram showing scanning procedure [<a href="#B44-jmse-13-00430" class="html-bibr">44</a>].</p>
Full article ">Figure 2
<p>Acoustic image of target.</p>
Full article ">Figure 3
<p>(<b>a</b>) STr architecture [<a href="#B45-jmse-13-00430" class="html-bibr">45</a>]. (<b>b</b>) STr block [<a href="#B45-jmse-13-00430" class="html-bibr">45</a>].</p>
Full article ">Figure 4
<p>YOLOv5 architecture.</p>
Full article ">Figure 5
<p>YOLOv5 incorporating STr block.</p>
Full article ">Figure 6
<p>Confusion matrix results. (<b>a</b>) Results obtained by Y5s; (<b>b</b>) results obtained by Y5S1; (<b>c</b>) results obtained by Y5S2; (<b>d</b>) results obtained by Y5S3.</p>
Full article ">Figure 7
<p>Results obtained by Y5s.</p>
Full article ">Figure 8
<p>Results obtained by Y5S3.</p>
Full article ">Figure 9
<p>Heatmaps.</p>
Full article ">Figure 10
<p>DeepSORT tracing framework.</p>
Full article ">Figure 11
<p>The extended target frame of the DeepSORT structure.</p>
Full article ">Figure 12
<p>Original target box (solid line) and expanded target box (dashed line) in sonar imagery.</p>
Full article ">Figure 13
<p>Comparison of classification accuracy. (<b>a</b>) Results of the ResNet50 network; (<b>b</b>) results of the MobileNet network.</p>
Full article ">Figure 14
<p>Data collection environment.</p>
Full article ">Figure 15
<p>Five types of targets and their sonar images.</p>
Full article ">Figure 16
<p>Underwater object trajectory.</p>
Full article ">Figure 17
<p>Layout of test waters and object placement. (<b>a</b>) Movement trajectory; (<b>b</b>) object placement.</p>
Full article ">Figure 18
<p>Sonar image sequence of targets in lake. (<b>a</b>) The 30th image; (<b>b</b>) the 90th image; (<b>c</b>) the 130th image; (<b>d</b>) the 170th image.</p>
Full article ">Figure 19
<p>Sonar-simulated image (sphere).</p>
Full article ">Figure 20
<p>Comparison results of different targets in genuine acoustic images (<b>left</b>), simulated images (<b>middle</b>), and generated images (<b>right</b>). (<b>a</b>) Sphere target; (<b>b</b>) dummy model; (<b>c</b>) cylinder target; (<b>d</b>) tire target.</p>
Full article ">Figure 21
<p>Sonar image sequences generated by Pix2PixHD net. (<b>a</b>) Movement trajectory of object in the simulated image; (<b>b</b>) movement trajectory of object in the generated image.</p>
Full article ">Figure 22
<p>Object trajectory based on expanded data.</p>
Full article ">Figure 23
<p>Image sequences of intersection trajectory (spherical). (<b>a</b>) The 10th image; (<b>b</b>) the 60th image; (<b>c</b>) the 90th image; (<b>d</b>) the 100th image; (<b>e</b>) the 130th image; (<b>f</b>) the 180th image.</p>
Full article ">Figure 24
<p>Image sequences of the straight line-crossing trajectory (spherical). (<b>a</b>) The 10th image; (<b>b</b>) the 60th image; (<b>c</b>) the 90th image; (<b>d</b>) the 100th image; (<b>e</b>) the 130th image; (<b>f</b>) the 180th image.</p>
Full article ">Figure 25
<p>An overview of indexes used to assess target tracking performance. (<b>a</b>) ID Switch in the case of trajectory interruption (ID Switch = 1, Frag Ratio = 0.2); (<b>b</b>) ID Switch in the case of trajectory interruption (ID Switch = 1, Frag Ratio = 0.0); (<b>c</b>) trajectory interruption without ID Switch (Frag Ratio = 0.2); (<b>d</b>) trajectory interruption without ID Switch (Frag Ratio = 0.4).</p>
Full article ">Figure 26
<p>Tracking results of the dummy model. (<b>a</b>) The target motion trajectory; (<b>b</b>) results obtained by the SORT method; (<b>c</b>) results obtained by the DeepSORT method; (<b>d</b>) results obtained by the ExDeepSORT method.</p>
Full article ">Figure 26 Cont.
<p>Tracking results of the dummy model. (<b>a</b>) The target motion trajectory; (<b>b</b>) results obtained by the SORT method; (<b>c</b>) results obtained by the DeepSORT method; (<b>d</b>) results obtained by the ExDeepSORT method.</p>
Full article ">Figure 27
<p>Tracking results of a dummy model by ExDeepSORT. (<b>a</b>) The 60th image; (<b>b</b>) the 90th image; (<b>c</b>) the 150th image; (<b>d</b>) the 180th image.</p>
Full article ">Figure 28
<p>Tracking results of the dummy model. (<b>a</b>) The target motion trajectory; (<b>b</b>) results obtained by the SORT method; (<b>c</b>) results obtained by the DeepSORT method; (<b>d</b>) results obtained by the ExDeepSORT method.</p>
Full article ">Figure 29
<p>Tracking results of the dummy model. (<b>a</b>) The 60th image; (<b>b</b>) the 90th image; (<b>c</b>) the 150th image; (<b>d</b>) the 180th image.</p>
Full article ">Figure 30
<p>Tracking results of the diver. (<b>a</b>) The target motion trajectory; (<b>b</b>) results obtained by the SORT method; (<b>c</b>) results obtained by the DeepSORT method; (<b>d</b>) results obtained by the ExDeepSORT method.</p>
Full article ">Figure 31
<p>Tracking results of single diver in a pool using ExDeepSORT. (<b>a</b>) The 70th image; (<b>b</b>) the 150th image; (<b>c</b>) the 200th image; (<b>d</b>) the 250th image.</p>
Full article ">Figure 31 Cont.
<p>Tracking results of single diver in a pool using ExDeepSORT. (<b>a</b>) The 70th image; (<b>b</b>) the 150th image; (<b>c</b>) the 200th image; (<b>d</b>) the 250th image.</p>
Full article ">Figure 32
<p>Tracking results based on sea trial data. (<b>a</b>) Target motion trajectory; (<b>b</b>) results obtained by SORT method; (<b>c</b>) results obtained by DeepSORT method; (<b>d</b>) results obtained by ExDeepSORT method.</p>
Full article ">Figure 33
<p>Tracking results obtained by ExDeepSORT method. (<b>a</b>) The 50th image; (<b>b</b>) the 100th image; (<b>c</b>) the 150th image; (<b>d</b>) the 200th image.</p>
Full article ">
15 pages, 5230 KiB  
Article
Vehicle Exhaust Estimation Using YOLOv7 and Support Vector Regression with Image Features
by Yun-Sin Lin, Ting-Yu Chen, Jiun-Jian Liaw, Hsi-Hsien Yang and Cheng-Hsiung Hsieh
Information 2025, 16(3), 168; https://doi.org/10.3390/info16030168 - 24 Feb 2025
Viewed by 207
Abstract
Vehicle exhaust is a major source of air pollution that contributes to environmental degradation and poses risks to public health. This paper presents an image-based method to estimate opacity (OP) and particulate matter (PM) from vehicle exhaust. In the proposed method, YOLOv7 was [...] Read more.
Vehicle exhaust is a major source of air pollution that contributes to environmental degradation and poses risks to public health. This paper presents an image-based method to estimate opacity (OP) and particulate matter (PM) from vehicle exhaust. In the proposed method, YOLOv7 was used to identify vehicles and, thus, the region of interest (ROI). Then, a support vector regression was trained, with four image features extracted from the ROI as the input vectors, while OP or PM was used as the output. The proposed method was verified by experiments where moving and static scenarios with three ROIs were considered. The ROIs used in the experiments were exhaust pipe area (EPA), vehicle bounding box (VBB), and white background (WBG). In the moving scenario, the EPA and VBB ROIs were considered. For the VBB ROI, the average R2 values for OP and PM in the given examples were 0.834 and 0.894, respectively. For the EPA ROI, the average R2 values for OP and PM were 0.838 and 0.910, respectively. In the static scenario, the EPA and WBG ROIs were considered. For the EPA ROI, the average R2 values for OP and PM were 0.619 and 0.612, respectively. For the WBG ROI, the average R2 values for OP and PM were 0.748 and 0.732, respectively. The results suggest that the EPA ROI is preferable in the moving scenario and the WBG ROI in the static scenario to estimate OP and PM from vehicle exhaust. The proposed method is promising in the estimation of OP and PM from vehicle exhaust because satisfactory R2 values were achieved. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>The VBB, EPA, and WBG ROIs: (<b>a</b>) VBB (red lines) and EPA (blue lines) for the moving scenario; (<b>b</b>) EPA (blue lines) and (<b>c</b>) WBG (green lines) for the static scenario.</p>
Full article ">Figure 2
<p>The Sobel edge feature <span class="html-italic">f<sub>se</sub></span> and the corresponding (<b>a</b>) OP and (<b>b</b>) PM.</p>
Full article ">Figure 3
<p>The dark channel feature <span class="html-italic">f<sub>dc</sub></span> and the corresponding (<b>a</b>) OP and (<b>b</b>) PM.</p>
Full article ">Figure 4
<p>The entropy feature <span class="html-italic">f<sub>en</sub></span> and the corresponding (<b>a</b>) OP and (<b>b</b>) PM.</p>
Full article ">Figure 5
<p>The feature <span class="html-italic">f<sub>rms</sub></span> and the related (<b>a</b>) OP and (<b>b</b>) PM.</p>
Full article ">Figure 6
<p>Block diagrams of the proposed method: (<b>a</b>) the training stage; (<b>b</b>) the testing stage.</p>
Full article ">Figure 7
<p>Data collection setup: (<b>a</b>) the moving scenario; (<b>b</b>) the static scenario.</p>
Full article ">Figure 8
<p>The OP and PM measurements in the moving scenario: (<b>a</b>) Video 1; (<b>b</b>) Video 2; (<b>c</b>) Video 3.</p>
Full article ">Figure 9
<p>The OP and PM measurements in the static scenario: (<b>a</b>) Videos 4 and 8; (<b>b</b>) Videos 5 and 9; (<b>c</b>) Videos 6 and 10; (<b>d</b>) Videos 7 and 11.</p>
Full article ">Figure 10
<p>Scatter plots for Video 1 for the EPA ROI in the moving scenario: (<b>a</b>) PM estimation (<span class="html-italic">R</span><sup>2</sup> = 0.937); (<b>b</b>) OP estimation (<span class="html-italic">R</span><sup>2</sup> = 0.919).</p>
Full article ">Figure 11
<p>Scatter plots for Video 7 for the WBG ROI in the static scenario: (<b>a</b>) PM estimation (<span class="html-italic">R</span><sup>2</sup> = 0.755); (<b>b</b>) OP estimation (<span class="html-italic">R</span><sup>2</sup> = 0.848).</p>
Full article ">
27 pages, 65983 KiB  
Article
Automatic Prompt Generation Using Class Activation Maps for Foundational Models: A Polyp Segmentation Case Study
by Hanna Borgli, Håkon Kvale Stensland and Pål Halvorsen
Mach. Learn. Knowl. Extr. 2025, 7(1), 22; https://doi.org/10.3390/make7010022 - 24 Feb 2025
Viewed by 292
Abstract
We introduce a weakly supervised segmentation approach that leverages class activation maps and the Segment Anything Model to generate high-quality masks using only classification data. A pre-trained classifier produces class activation maps that, once thresholded, yield bounding boxes encapsulating the regions of interest. [...] Read more.
We introduce a weakly supervised segmentation approach that leverages class activation maps and the Segment Anything Model to generate high-quality masks using only classification data. A pre-trained classifier produces class activation maps that, once thresholded, yield bounding boxes encapsulating the regions of interest. These boxes prompt the SAM to generate detailed segmentation masks, which are then refined by selecting the best overlap with automatically generated masks from the foundational model using the intersection over union metric. In a polyp segmentation case study, our approach outperforms existing zero-shot and weakly supervised methods, achieving a mean intersection over union of 0.63. This method offers an efficient and general solution for image segmentation tasks where segmentation data are scarce. Full article
(This article belongs to the Section Data)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>System flow chart: We first use a trained image classifier to generate a CAM for an input image. A bounding box is then extracted from the heatmap using a threshold that acts as a tuning hyperparameter. From our experiments, a threshold of 50% to 60% works for polyps. Next, we apply MobileSAM’s predictor class [<a href="#B18-make-07-00022" class="html-bibr">18</a>], which is faster than standard SAM’s, to generate an initial mask using the bounding box as a prompt. Then, SAM’s automatic mask generator creates a list of object masks. We filter out masks too large or located at the image corners and compare the remaining masks with the initial mask, selecting the one with the best IoU overlap, requiring at least 0.1 IoU. If no suitable overlap is found, we use the initial mask. In the chart, each mask in the list of filtered masks is shown with its own color.</p>
Full article ">Figure 2
<p>The figure shows a subset of images from the Kvasir-SEG dataset without corresponding ground truth masks. Images can come from both gastroscopic and endoscopic examinations, and we do not know where the polyps are located in the body. The images have a large distribution of different resolutions and have been processed to blacken out the box used for orienting the camera in the body. However, text is still an artifact for some images.</p>
Full article ">Figure 3
<p>The figure shows a subset of images from the colon polyps class from the GastroVision dataset. We use this class to classify the polyps in Kvasir-SEG regardless of where the polyp is located in the body. The images are distributed among seven different resolutions and have different artifacts, such as text and boxes for orienting the camera in the body.</p>
Full article ">Figure 4
<p>The graph shows scores for mIoU, dice, precision, and recall for incremental bounding box threshold percentages starting from 10% to 90%. The method used is the highest-scoring CAM method, ScoreCAM, with augmentation enabled. We see from the graph that the best results are achieved around 60% bounding box threshold.</p>
Full article ">Figure 5
<p>We show four graphs plotting mIoU for different CAM methods over thresholds incrementing by 10 percent from 10% to 90%. Each graph shows the methods with different combinations of enabling and disabling the smoothing methods for each CAM. From the graphs, we can see that the CAM methods and the smoothing techniques impact the results greatly. Augmentation smoothing seems to have little impact but does increase the results by very little, while eigen smoothing changes the best bounding box threshold but decreases the mIoU by a small amount. Combining the techniques seems to have a similar effect to using eigen smoothing alone.</p>
Full article ">Figure 6
<p>The figure shows eight results from experiment six, where we use the full method. For each row of images, we see the mask generated from the bounding box extracted from the CAM, the automatically generated masks from SAM where each color represents a mask, the best mask chosen between either the best overlapping mask in the list of automatically generated masks and the bounding box-generated mask, or falling back to using the bounding box-generated mask if there are no masks with an IoU overlap over 0.1. Finally, we see the ground truth, which we can compare our best mask with.</p>
Full article ">Figure 7
<p>A set of six different edge cases from the bounding box mask generation method when running ScoreCAM with augmentation. Each case shows the ground truth, the generated CAM, and the bounding box with the mask generated by the SAM. Cases 1 and 6 show masks that overlap completely, but due to the CAM activating for only parts of the polyp, the bounding box captures only part of the polyp. Case 2 shows the classifier predicting the correct label, but the CAM is inaccurate and too big, causing the bounding box to miss the polyp. Cases 3 and 4 show the classifier not being able to label correctly with the CAM missing completely. Case 5 shows a correct classification and accurate CAM for one of the polyps in the image but misses one entirely. For the one detected, the SAM fails to segment the polyp correctly.</p>
Full article ">Figure 8
<p>The scatter plot shows the calculated ROAD combined score and IoU between the generated mask and ground truth mask for each image in the Kvasir-SEG dataset. A high ROAD score means the CAM represents the object, and a low score means the CAM does not significantly impact the classifier’s confidence. The method for mask generation was the best result from the Bayesian optimization experiment, which was ScoreCAM with augmentation smoothing and a bounding box threshold of 0.62. The plot shows that there is a correlation between higher ROAD combined scores and higher IoU, but that the distribution of ROAD combined scores is also high for lower IoU scores.</p>
Full article ">Figure 9
<p>The figure shows visualizations picked from two different edge cases when calculating the ROAD combined score. Each column shows a step in the progress. In the column “SAM Automatic Mask Generator”, each mask is denoted by a separate color. In the column “Best Mask”, pink means we use the best overlapping mask and blue means we use the bounding box generated mask. The first case is when we have a low IoU score but a high ROAD combined score. This indicates that the CAM activates on parts of the image, which is important for classification, and we would expect the IoU to be high as well. However, we see from the first row that the CAM activates on the central part of the polyp, while the polyp is actually huge. The second row shows the case where there are two polyps in the image, and the smallest one is found, but the second and bigger one is not recognized. The third row shows the case where the CAM covers the polyp, but the SAM fails to segment it correctly. The second case is when we have a low ROAD combined score but a high IoU score. In the first two rows of the second case, the CAM covers the object well, and we obtain a good mask. However, we obtain a low ROAD score because the classifier has a low confidence score for polyps. We force the activation of the polyp class for the CAM, but the confidence score is almost zero. When the classifier’s baseline confidence is very low, even accurate CAM regions cause only a small drop in output upon perturbation. This minimal change leads to a low ROAD score because the metric’s sensitivity is reduced when the overall class score is low. This is a case where the ROAD combined score does not correctly reflect the performance of the CAM. In the final row, the classifier classifies the image as containing a colon polyp, but the polyp is so big that the first CAM does not cover it properly. Removing the CAM allows the classifier to see the parts not covered by the CAM, so it does not create a drop in confidence, and the ROAD score is low.</p>
Full article ">Figure 10
<p>A selection of visually evaluated good results was made using either the filtering method, shown with pink masks, or the bounding box method, shown with blue masks and boxes, which greatly improved results over the zero-shot methods evaluated in this paper. The images are from selected classes in the Gastrovision [<a href="#B32-make-07-00022" class="html-bibr">32</a>] dataset and do not have segmentation masks available to the authors.</p>
Full article ">Figure 11
<p>We show six images from the ImageNet-1k validation split using the method presented in our paper. Each row shows the different steps of the process. First, we obtain the CAM, and then we extract a bounding box based on a threshold of our choosing. We then generate a mask using the bounding box, shown in blue. Further, we find the best overlap with automatically generated masks, shown in a different color for each mask, from the image. The best overlapping mask is chosen for our final mask, shown in pink. We use the pre-trained DenseNet-121 model available in PyTorch, which is trained on the ImageNet [<a href="#B37-make-07-00022" class="html-bibr">37</a>] dataset. The figure shows that our method can also be extended to other datasets. However, it struggles when there is more than one case of the object or the object can be segmented in unconnected masks.</p>
Full article ">Figure 12
<p>The size of the bounding box can make a big difference for mask generation or selection. In this figure, we show that for this particular image, the belt is chosen as the best overlapping mask. However, a bigger bounding box is able to capture the whole object. Therefore, the bounding box threshold must be adjusted according to the size of the object we try to segment.</p>
Full article ">
Back to TopTop