MDPI - Publisher of Open Access Journals

17 pages, 9404 KiB

Open AccessArticle

SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking

by Yan Ding, Yuchen Ling, Bozhi Zhang, Jiaxin Li, Lingxi Guo and Zhe Yang

Sensors 2024, 24(18), 6015; https://doi.org/10.3390/s24186015 - 17 Sep 2024

Cited by 1 | Viewed by 1286

Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient [...] Read more.

Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient target loss in real-world scenarios. This paper rethinks state prediction and fusion based on target temporal features to address these issues and proposes the SimpleTrackV2 algorithm, building on the previously designed SimpleTrack. Firstly, to address the poor prediction performance of linear motion models in complex scenes, we designed a target state prediction algorithm called LSTM-MP, based on long short-term memory (LSTM). This algorithm encodes the target’s historical motion information using LSTM and decodes it with a multilayer perceptron (MLP) to achieve target state prediction. Secondly, to mitigate the effect of occlusion on target state saliency, we designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) target state fusion algorithm based on the attention mechanism. TSA-FF calculates adaptive fusion coefficients to enhance target state fusion, thereby improving the accuracy of subsequent data association. To demonstrate the effectiveness of the proposed method, we compared SimpleTrackV2 with the baseline model SimpleTrack on the MOT17 dataset. We also conducted ablation experiments on TSA-FF and LSTM-MP for SimpleTrackV2, exploring the optimal number of fusion frames and the impact of different loss functions on model performance. The experimental results show that SimpleTrackV2 handles camera jitter and target occlusion better, achieving improvements of 1.6%, 3.2%, and 6.1% in MOTA, IDF1, and HOTA, respectively, compared to the SimpleTrack algorithm. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

Figure 1
Motion state differences and linear Kalman filter predictions for the pedestrian target with video sequence ID 23 in MOT17-09. The linear Kalman filter predicts the center point coordinates and the width and height of the target’s bounding box based on a linear motion model. The black curve represents the frame-to-frame differences in the Kalman filter’s predictions, the blue and green curves show the differences in the horizontal and vertical coordinates of the target’s center point, respectively, the purple and orange curves illustrate the differences in the target’s width and height, and the red and cyan curves depict the intersection over union (IoU) of the predicted bounding boxes and the degree of target occlusion. Full article ">Figure 2
The target state prediction and fusion pipeline in SimpleTrackV2. SimpleTrackV2 inherits the feature decoupling and association components from SimpleTrack, with the improvements highlighted in the red boxes, which primarily include the design of LSTM-MP for target state prediction and TSA-FF for target state fusion. In the diagram, the blue output lines from feature decoupling represent the extracted positional features, while the purple arrows indicate the extracted appearance features. The LSTM-MP model utilizes the differences in the target’s motion state between consecutive frames as input to predict future motion states. The TSA-FF algorithm computes temporal and spatial attention weights to determine fusion weight coefficients, facilitating the effective fusion of target states. Full article ">Figure 3
The LSTM-MP model begins by applying a feature transformation method to convert the motion features of the target’s historical frames into the difference between its front and back frame motion states. This difference is then used as input to the model. The LSTM-MP model encodes and reduces the dimensionality of these features through a multilayer perceptron (MLP) network. Finally, an inverse feature transformation is applied to predict the target’s future motion state, thereby achieving effective motion state prediction. Full article ">Figure 4
The overall structure of the LSTM-MP model. Full article ">Figure 5
The TSA-FF algorithm first computes the temporal and spatial attention weights for the appearance feature vectors of different time frames for the same target. It then sums the self-spatial attention coefficients of each historical frame with the interaction spatial attention coefficients and multiplies this sum by the temporal attention weights to determine the final fusion coefficients. Finally, the appearance features of each historical frame are weighted and averaged using these fusion coefficients, followed by normalization to obtain the fused target appearance features. Full article ">

19 pages, 9912 KiB

Open AccessArticle

Research on VVC Intra-Frame Bit Allocation Scheme Based on Significance Detection

by Xuesong Jin, Huiyuan Sun and Yuhang Zhang

Appl. Sci. 2024, 14(1), 471; https://doi.org/10.3390/app14010471 - 4 Jan 2024

Cited by 1 | Viewed by 2999

Abstract

This research is based on an intra-frame rate control algorithm based on the Versatile Video Coding (VVC) standard, considering that there is the phenomenon of over-allocating the bitrate of the end coding tree units (CTUs) in the bit allocation process, while the front [...] Read more.

This research is based on an intra-frame rate control algorithm based on the Versatile Video Coding (VVC) standard, considering that there is the phenomenon of over-allocating the bitrate of the end coding tree units (CTUs) in the bit allocation process, while the front CTUs are not effectively compressed. Fusing a Canny-based edge detection algorithm, a color contrast-based saliency detection algorithm, a Sum of Absolute Transformed Differences (SATD) based CTU coding complexity measure, and a Partial Least Squares (PLS) regression model, this paper proposes a CTU-level bit allocation improvement scheme for intra-mode code rate control of the VVC standard. First, natural images are selected to produce a lightweight dataset. Second, different metrics are utilized to obtain the significance and complexity values of each coding unit, the relatively important coding units in the whole frame are selected, which are adjusted with different weights, and the optimal adjustment multiplicity is supplemented into the dataset. Finally, the PLS regression model was used to obtain regression equations to refine the weights for adjusting the bit allocation. The proposed bit allocation scheme improves the average rate control accuracy by 0.453%, Y-PSNR by 0.05 dB, BD-rate savings by 0.33%, and BD-PSNR by 0.03 dB compared to the VVC standard rate control algorithm. Full article

► Show Figures

Figure 1

Figure 1
Development process of video coding standards in various series. Full article ">Figure 2
The bpp situation of all CTUs in the first frame of intra-frame mode under the conditions of the default rate control algorithm. (a) BQMall; (b) Cactus; (c) BQSquare; (d) Fourpeople. Full article ">Figure 3
The overall flow of the proposed algorithm. Full article ">Figure 4
Significance information extraction using mv5 as an example. (a) original image; (b) grayscale image; (c) significant map based on Canny algorithm; and (d) significant map based on color contrast algorithm. Full article ">Figure 5
Feature information of mv5 as an example. (a) coding complexity based on SATD; (b) significance value based on Canny algorithm; (c) significance value based on color contrast algorithm. Full article ">Figure 6
Relationship between multiplicity and rate control efficiency for the example of mv5. Full article ">Figure 7
Regression model analysis for improved CTUs. Full article ">Figure 8
Bit tuning for a test sequence with a set maximum bit cost. (a) PeopleOnStreet; (b) BasketballDrive; (c) BQMall; (d) RaceHorses; (e) FourPeople; (f) BasketballDrillText. Full article ">

21 pages, 4424 KiB

Open AccessArticle

A Salient Object Detection Method Based on Boundary Enhancement

by Falin Wen, Qinghui Wang, Ruirui Zou, Ying Wang, Fenglin Liu, Yang Chen, Linghao Yu, Shaoyi Du and Chengzhi Yuan

Sensors 2023, 23(16), 7077; https://doi.org/10.3390/s23167077 - 10 Aug 2023

Viewed by 1556

Abstract

Visual saliency refers to the human’s ability to quickly focus on important parts of their visual field, which is a crucial aspect of image processing, particularly in fields like medical imaging and robotics. Understanding and simulating this mechanism is crucial for solving complex [...] Read more.

Visual saliency refers to the human’s ability to quickly focus on important parts of their visual field, which is a crucial aspect of image processing, particularly in fields like medical imaging and robotics. Understanding and simulating this mechanism is crucial for solving complex visual problems. In this paper, we propose a salient object detection method based on boundary enhancement, which is applicable to both 2D and 3D sensors data. To address the problem of large-scale variation of salient objects, our method introduces a multi-level feature aggregation module that enhances the expressive ability of fixed-resolution features by utilizing adjacent features to complement each other. Additionally, we propose a multi-scale information extraction module to capture local contextual information at different scales for back-propagated level-by-level features, which allows for better measurement of the composition of the feature map after back-fusion. To tackle the low confidence issue of boundary pixels, we also introduce a boundary extraction module to extract the boundary information of salient regions. This information is then fused with salient target information to further refine the saliency prediction results. During the training process, our method uses a mixed loss function to constrain the model training from two levels: pixels and images. The experimental results demonstrate that our salient target detection method based on boundary enhancement shows good detection effects on targets of different scales, multi-targets, linear targets, and targets in complex scenes. We compare our method with the best method in four conventional datasets and achieve an average improvement of 6.2% on the mean absolute error (MAE) indicators. Overall, our approach shows promise for improving the accuracy and efficiency of salient object detection in a variety of settings, including those involving 2D/3D semantic analysis and reconstruction/inpainting of image/video/point cloud data. Full article

(This article belongs to the Special Issue Machine Learning Based 2D/3D Sensors Data Understanding and Analysis)

► Show Figures

Figure 1

22 pages, 2661 KiB

Open AccessArticle

Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network

by Hayat Ullah and Arslan Munir

Algorithms 2023, 16(8), 369; https://doi.org/10.3390/a16080369 - 31 Jul 2023

Cited by 2 | Viewed by 1674

Abstract

The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms [...] Read more.

The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efficiency. This biased trade-off between robustness and efficiency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efficient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efficient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a unified channel-spatial attention mechanism, allowing it to efficiently extract significant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive fields containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efficiency of our method, showcasing significant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition. Full article

(This article belongs to the Special Issue Algorithms for Image Processing and Machine Vision)

► Show Figures

Figure 1

20 pages, 2466 KiB

Open AccessArticle

SDebrisNet: A Spatial–Temporal Saliency Network for Space Debris Detection

by Jiang Tao, Yunfeng Cao and Meng Ding

Appl. Sci. 2023, 13(8), 4955; https://doi.org/10.3390/app13084955 - 14 Apr 2023

Cited by 10 | Viewed by 3327

Abstract

The rapidly growing number of space activities is generating numerous space debris, which greatly threatens the safety of space operations. Therefore, space-based space debris surveillance is crucial for the early avoidance of spacecraft emergencies. With the progress in computer vision technology, space debris [...] Read more.

The rapidly growing number of space activities is generating numerous space debris, which greatly threatens the safety of space operations. Therefore, space-based space debris surveillance is crucial for the early avoidance of spacecraft emergencies. With the progress in computer vision technology, space debris detection using optical sensors has become a promising solution. However, detecting space debris at far ranges is challenging due to its limited imaging size and unknown movement characteristics. In this paper, we propose a space debris saliency detection algorithm called SDebrisNet. The algorithm utilizes a convolutional neural network (CNN) to take into account both spatial and temporal data from sequential video images, which aim to assist in detecting small and moving space debris. Firstly, taking into account the limited resource of the space-based computational platform, a MobileNet-based space debris feature extraction structure was constructed to make the overall model more lightweight. In particular, an enhanced spatial feature module is introduced to strengthen the spatial details of small objects. Secondly, based on attention mechanisms, a constrained self-attention (CSA) module is applied to learn the spatiotemporal data from the sequential images. Finally, a space debris dataset was constructed for algorithm evaluation. The experimental results demonstrate that the method proposed in this paper is robust for detecting moving space debris with a low signal-to-noise ratio in the video. Compared to the NODAMI method, SDebrisNet shows improvements of 3.5% and 1.7% in terms of detection probability and the false alarm rate, respectively. Full article

(This article belongs to the Special Issue Vision-Based Autonomous Unmanned Systems: Challenges and Approaches)

► Show Figures

Figure 1

17 pages, 5104 KiB

Open AccessArticle

Video Saliency Object Detection with Motion Quality Compensation

by Hengsen Wang, Chenglizhao Chen, Linfeng Li and Chong Peng

Electronics 2023, 12(7), 1618; https://doi.org/10.3390/electronics12071618 - 30 Mar 2023

Cited by 1 | Viewed by 1721

Abstract

Video saliency object detection is one of the classic research problems in computer vision, yet existing works rarely focus on the impact of input quality on model performance. As optical flow is a key input for video saliency detection models, its quality significantly [...] Read more.

Video saliency object detection is one of the classic research problems in computer vision, yet existing works rarely focus on the impact of input quality on model performance. As optical flow is a key input for video saliency detection models, its quality significantly affects model performance. Traditional optical flow models only calculate the optical flow between two consecutive video frames, ignoring the motion state of objects over a period of time, leading to low-quality optical flow and reduced performance of video saliency object detection models. Therefore, this paper proposes a new optical flow model that improves the quality of optical flow by expanding the flow perception range and uses high-quality optical flow to enhance the performance of video saliency object detection models. Experimental results on the datasets show that the proposed optical flow model can significantly improve optical flow quality, with the S-M values on the DAVSOD dataset increasing by about 39%, 49%, and 44% compared to optical flow models such as PWCNet, SpyNet, and LFNet. In addition, experiments that fine-tuning the benchmark model LIMS demonstrate that improving input quality can further improve model performance. Full article

(This article belongs to the Special Issue New Technologies in Digital Media Processing: When Computer Vision Meets Natural Language Processing)

► Show Figures

Figure 1

Figure 1
(a) It represents traditional optical flow models that only use two frames of images for optical flow calculation. (b) Optical flow maps generated by the traditional optical flow models. Full article ">Figure 2
The (a) The new optical flow model proposed in this paper. (b) Fused high-quality optical flow maps and RGB images using a traditional feature fusion module to obtain a saliency map. Full article ">Figure 3
Color saliency is obtained from RGB images through the color saliency module, while motion saliency maps are obtained from optical flow maps through the motion saliency module. Full article ">Figure 4
Comparison of optical flow maps generated by different optical flows in multiple scenes, “Ours” represents the optical flow map generated by the optical flow model proposed in this paper. Full article ">Figure 5
High-quality motion saliency maps and color saliency maps often have higher consistency in structure. Full article ">Figure 6
Comparison of motion saliency maps generated by four different optical flow models, with “Ours” representing the motion saliency map generated by the optical flow model proposed in this paper. Full article ">Figure 7
Comparison of saliency maps generated by different VSOD models, “LIMS+Our” refers to the saliency map generated by fine-tuning the LIMS model using the optical flow model proposed in this paper. Full article ">

16 pages, 8732 KiB

Open AccessArticle

Just Noticeable Difference Model for Images with Color Sensitivity

by Zhao Zhang, Xiwu Shang, Guoping Li and Guozhong Wang

Sensors 2023, 23(5), 2634; https://doi.org/10.3390/s23052634 - 27 Feb 2023

Cited by 1 | Viewed by 3433

Abstract

The just noticeable difference (JND) model reflects the visibility limitations of the human visual system (HVS), which plays an important role in perceptual image/video processing and is commonly applied to perceptual redundancy removal. However, existing JND models are usually constructed by treating the [...] Read more.

The just noticeable difference (JND) model reflects the visibility limitations of the human visual system (HVS), which plays an important role in perceptual image/video processing and is commonly applied to perceptual redundancy removal. However, existing JND models are usually constructed by treating the color components of three channels equally, and their estimation of the masking effect is inadequate. In this paper, we introduce visual saliency and color sensitivity modulation to improve the JND model. Firstly, we comprehensively combined contrast masking, pattern masking, and edge protection to estimate the masking effect. Then, the visual saliency of HVS was taken into account to adaptively modulate the masking effect. Finally, we built color sensitivity modulation according to the perceptual sensitivities of HVS, to adjust the sub-JND thresholds of Y, Cb, and Cr components. Thus, the color-sensitivity-based JND model (CSJND) was constructed. Extensive experiments and subjective tests were conducted to verify the effectiveness of the CSJND model. We found that consistency between the CSJND model and HVS was better than existing state-of-the-art JND models. Full article

(This article belongs to the Special Issue Image/Signal Processing and Machine Vision in Sensing Applications)

► Show Figures

Figure 1

20 pages, 4998 KiB

Open AccessArticle

Quality-Driven Dual-Branch Feature Integration Network for Video Salient Object Detection

by Xiaofei Zhou, Hanxiao Gao, Longxuan Yu, Defu Yang and Jiyong Zhang

Electronics 2023, 12(3), 680; https://doi.org/10.3390/electronics12030680 - 29 Jan 2023

Cited by 4 | Viewed by 1673

Abstract

Video salient object detection has attracted growing interest in recent years. However, some existing video saliency models often suffer from the inappropriate utilization of spatial and temporal cues and the insufficient aggregation of different level features, leading to remarkable performance degradation. Therefore, we [...] Read more.

Video salient object detection has attracted growing interest in recent years. However, some existing video saliency models often suffer from the inappropriate utilization of spatial and temporal cues and the insufficient aggregation of different level features, leading to remarkable performance degradation. Therefore, we propose a quality-driven dual-branch feature integration network majoring in the adaptive fusion of multi-modal cues and sufficient aggregation of multi-level spatiotemporal features. Firstly, we employ the quality-driven multi-modal feature fusion (QMFF) module to combine the spatial and temporal features. Particularly, the quality scores estimated from each level’s spatial and temporal cues are not only used to weigh the two modal features but also to adaptively integrate the coarse spatial and temporal saliency predictions into the guidance map, which further enhances the two modal features. Secondly, we deploy the dual-branch-based multi-level feature aggregation (DMFA) module to integrate multi-level spatiotemporal features, where the two branches including the progressive decoder branch and the direct concatenation branch sufficiently explore the cooperation of multi-level spatiotemporal features. In particular, in order to provide an adaptive fusion for the outputs of the two branches, we design the dual-branch fusion (DF) unit, where the channel weight of each output can be learned jointly from the two outputs. The experiments conducted on four video datasets clearly demonstrate the effectiveness and superiority of our model against the state-of-the-art video saliency models. Full article

► Show Figures

Figure 1

18 pages, 3040 KiB

Open AccessArticle

OTNet: A Small Object Detection Algorithm for Video Inspired by Avian Visual System

by Pingge Hu, Xingtong Wang, Xiaoteng Zhang, Yueyang Cang and Li Shi

Mathematics 2022, 10(21), 4125; https://doi.org/10.3390/math10214125 - 4 Nov 2022

Cited by 2 | Viewed by 2510

Abstract

Small object detection is one of the most challenging and non-negligible fields in computer vision. Inspired by the location–focus–identification process of the avian visual system, we present our location-focused small-object-detection algorithm for video or image sequence, OTNet. The model contains three modules corresponding [...] Read more.

Small object detection is one of the most challenging and non-negligible fields in computer vision. Inspired by the location–focus–identification process of the avian visual system, we present our location-focused small-object-detection algorithm for video or image sequence, OTNet. The model contains three modules corresponding to the forms of saliency, which drive the strongest response of OT to calculate the saliency map. The three modules are responsible for temporal–spatial feature extraction, spatial feature extraction and memory matching, respectively. We tested our model on the AU-AIR dataset and achieved up to 97.95% recall rate, 85.73% precision rate and 89.94 F₁ score with a lower computational complexity. Our model is also able to work as a plugin module for other object detection models to improve their performance in bird-view images, especially for detecting smaller objects. We managed to improve the detection performance by up to 40.01%. The results show that our model performs well on the common metrics on detection, while simulating visual information processing for object localization of the avian brain. Full article

(This article belongs to the Special Issue Mathematical Method and Application of Machine Learning)

► Show Figures

Figure 1

Figure 1
The transverse section of midbrain showing the OT [<a href="#B31-mathematics-10-04125" class="html-bibr">31</a>]. Full article ">Figure 2
The structure of our algorithm. Full article ">Figure 3
The detailed structures of the models. (a) The overall structure of the model. (b) The difference between OTNet and OTNet-Lite. Full article ">Figure 4
The changing of the precision and recall rate through the training process. (a) The precision rate changing. (b) The recall rate changing. Full article ">Figure 5
The structure of our algorithm. (a) The ground truth. (b) The KLT result. (c) The LK result. Full article ">Figure 6
The location result of OTNet in different scenes. (a) Multi-objects. (b) Truncation. (c) Bird view. (d) Tiny objects. (e) Occlusion. (f) Low contrast. Full article ">Figure 7
The classification result of OTNet-C. The larger boxes and the smaller boxes represent different categories. (a) Result for the parking lot scene. (b) Result for the circular road scene. Full article ">

24 pages, 3279 KiB

Open AccessArticle

Saliency-Enabled Coding Unit Partitioning and Quantization Control for Versatile Video Coding

by Wei Li, Xiantao Jiang, Jiayuan Jin, Tian Song and Fei Richard Yu

Information 2022, 13(8), 394; https://doi.org/10.3390/info13080394 - 19 Aug 2022

Cited by 5 | Viewed by 2335

Abstract

The latest video coding standard, versatile video coding (VVC), has greatly improved coding efficiency over its predecessor standard high efficiency video coding (HEVC), but at the expense of sharply increased complexity. In the context of perceptual video coding (PVC), the visual saliency model [...] Read more.

The latest video coding standard, versatile video coding (VVC), has greatly improved coding efficiency over its predecessor standard high efficiency video coding (HEVC), but at the expense of sharply increased complexity. In the context of perceptual video coding (PVC), the visual saliency model that utilizes the characteristics of the human visual system to improve coding efficiency has become a reliable method due to advances in computer performance and visual algorithms. In this paper, a novel VVC optimization scheme compliant PVC framework is proposed, which consists of fast coding unit (CU) partition algorithm and quantization control algorithm. Firstly, based on the visual saliency model, we proposed a fast CU division scheme, including the redetermination of the CU division depth by calculating Scharr operator and variance, as well as the executive decision for intra sub-partitions (ISP), to reduce the coding complexity. Secondly, a quantization control algorithm is proposed by adjusting the quantization parameter based on multi-level classification of saliency values at the CU level to reduce the bitrate. In comparison with the reference model, experimental results indicate that the proposed method can reduce about 47.19% computational complexity and achieve a bitrate saving of 3.68% on average. Meanwhile, the proposed algorithm has reasonable peak signal-to-noise ratio losses and nearly the same subjective perceptual quality. Full article

(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)

► Show Figures

Figure 1

30 pages, 12327 KiB

Open AccessArticle

ShadowDeNet: A Moving Target Shadow Detection Network for Video SAR

by Jinyu Bao, Xiaoling Zhang, Tianwen Zhang and Xiaowo Xu

Remote Sens. 2022, 14(2), 320; https://doi.org/10.3390/rs14020320 - 11 Jan 2022

Cited by 14 | Viewed by 3307

Abstract

Most existing SAR moving target shadow detectors not only tend to generate missed detections because of their limited feature extraction capacity among complex scenes, but also tend to bring about numerous perishing false alarms due to their poor foreground–background discrimination capacity. Therefore, to [...] Read more.

Most existing SAR moving target shadow detectors not only tend to generate missed detections because of their limited feature extraction capacity among complex scenes, but also tend to bring about numerous perishing false alarms due to their poor foreground–background discrimination capacity. Therefore, to solve these problems, this paper proposes a novel deep learning network called “ShadowDeNet” for better shadow detection of moving ground targets on video synthetic aperture radar (SAR) images. It utilizes five major tools to guarantee its superior detection performance, i.e., (1) histogram equalization shadow enhancement (HESE) for enhancing shadow saliency to facilitate feature extraction, (2) transformer self-attention mechanism (TSAM) for focusing on regions of interests to suppress clutter interferences, (3) shape deformation adaptive learning (SDAL) for learning moving target deformed shadows to conquer motion speed variations, (4) semantic-guided anchor-adaptive learning (SGAAL) for generating optimized anchors to match shadow location and shape, and (5) online hard-example mining (OHEM) for selecting typical difficult negative samples to improve background discrimination capacity. We conduct extensive ablation studies to confirm the effectiveness of the above each contribution. We perform experiments on the public Sandia National Laboratories (SNL) video SAR data. Experimental results reveal the state-of-the-art performance of ShadowDeNet, with a 66.01% best f1 accuracy, in contrast to the other five competitive methods. Specifically, ShadowDeNet is superior to the experimental baseline Faster R-CNN by a 9.00% f1 accuracy, and superior to the existing first-best model by a 4.96% f1 accuracy. Furthermore, ShadowDeNet merely sacrifices a slight detection speed in an acceptable range. Full article

(This article belongs to the Special Issue Artificial Intelligence-Based Learning Approaches for Remote Sensing)

► Show Figures

Figure 1

19 pages, 4031 KiB

Open AccessArticle

Saliency Detection with Moving Camera via Background Model Completion

by Yu-Pei Zhang and Kwok-Leung Chan

Sensors 2021, 21(24), 8374; https://doi.org/10.3390/s21248374 - 15 Dec 2021

Cited by 2 | Viewed by 2593

Abstract

Detecting saliency in videos is a fundamental step in many computer vision systems. Saliency is the significant target(s) in the video. The object of interest is further analyzed for high-level applications. The segregation of saliency and the background can be made if they [...] Read more.

Detecting saliency in videos is a fundamental step in many computer vision systems. Saliency is the significant target(s) in the video. The object of interest is further analyzed for high-level applications. The segregation of saliency and the background can be made if they exhibit different visual cues. Therefore, saliency detection is often formulated as background subtraction. However, saliency detection is challenging. For instance, dynamic background can result in false positive errors. In another scenario, camouflage will result in false negative errors. With moving cameras, the captured scenes are even more complicated to handle. We propose a new framework, called saliency detection via background model completion (SD-BMC), that comprises a background modeler and a deep learning background/foreground segmentation network. The background modeler generates an initial clean background image from a short image sequence. Based on the idea of video completion, a good background frame can be synthesized with the co-existence of changing background and moving objects. We adopt the background/foreground segmenter, which was pre-trained with a specific video dataset. It can also detect saliency in unseen videos. The background modeler can adjust the background image dynamically when the background/foreground segmenter output deteriorates during processing a long video. To the best of our knowledge, our framework is the first one to adopt video completion for background modeling and saliency detection in videos captured by moving cameras. The F-measure results, obtained from the pan-tilt-zoom (PTZ) videos, show that our proposed framework outperforms some deep learning-based background subtraction models by 11% or more. With more challenging videos, our framework also outperforms many high-ranking background subtraction methods by more than 3%. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

Figure 1
Overview of the saliency detection framework: (a) background model initialization; (b) continuous saliency detection. Full article ">Figure 2
Video completion-based background modeler. Full article ">Figure 3
Visual results of background modeling: (a) original frame; (b) 30 initialization frames; and (c) 100 initialization frames. Full article ">Figure 4
Structure of foreground segmenter. Full article ">Figure 5
Visual results of BSUV-Net 2.0 and SD-BMC on CDNet 2014. Full article ">Figure 5 Cont.
Visual results of BSUV-Net 2.0 and SD-BMC on CDNet 2014. Full article ">Figure 6
Visual results on customized dataset. Full article ">Figure 7
Comparison of background frames used in BSUV-Net 2.0, PAWCS, SuBSENSE, and SD-BMC on PTZ video. Full article ">

14 pages, 2603 KiB

Open AccessArticle

B-Line Detection and Localization in Lung Ultrasound Videos Using Spatiotemporal Attention

by Hamideh Kerdegari, Nhat Tran Huy Phung, Angela McBride, Luigi Pisani, Hao Van Nguyen, Thuy Bich Duong, Reza Razavi, Louise Thwaites, Sophie Yacoub, Alberto Gomez and VITAL Consortium

Appl. Sci. 2021, 11(24), 11697; https://doi.org/10.3390/app112411697 - 9 Dec 2021

Cited by 13 | Viewed by 3535

Abstract

The presence of B-line artefacts, the main artefact reflecting lung abnormalities in dengue patients, is often assessed using lung ultrasound (LUS) imaging. Inspired by human visual attention that enables us to process videos efficiently by paying attention to where and when it is [...] Read more.

The presence of B-line artefacts, the main artefact reflecting lung abnormalities in dengue patients, is often assessed using lung ultrasound (LUS) imaging. Inspired by human visual attention that enables us to process videos efficiently by paying attention to where and when it is required, we propose a spatiotemporal attention mechanism for B-line detection in LUS videos. The spatial attention allows the model to focus on the most task relevant parts of the image by learning a saliency map. The temporal attention generates an attention score for each attended frame to identify the most relevant frames from an input video. Our model not only identifies videos where B-lines show, but also localizes, within those videos, B-line related features both spatially and temporally, despite being trained in a weakly-supervised manner. We evaluate our approach on a LUS video dataset collected from severe dengue patients in a resource-limited hospital, assessing the B-line detection rate and the model’s ability to localize discriminative B-line regions spatially and B-line frames temporally. Experimental results demonstrate the efficacy of our approach for classifying B-line videos with an F1 score of up to 83.2% and localizing the most salient B-line regions both spatially and temporally with a correlation coefficient of 0.67 and an IoU of 69.7%, respectively. Full article

(This article belongs to the Special Issue Computational Ultrasound Imaging and Applications)

► Show Figures

Figure 1

Figure 1
Sample LUS images. (Left): A healthy lung containing several A-line artefacts and, (Right): A dengue patient’s lung showing a B-line artefact as a result of fluid leakage into the lung. Full article ">Figure 2
The proposed architecture for LUS B-line detection and spatiotemporal localization. This model consists of a spatial feature extraction module (CNN layers), followed by a spatial attention network, then a bidirectional LSTM, and a temporal attention module. The parameters of each layer and module are detailed in the text. Full article ">Figure 3
The CNN architecture of the proposed model. It consists of four convolution layers with Relu activation function and; Maxpooling followed by the second and forth convolution layer. Full article ">Figure 4
Spatial attention module. Several layers of convolutional networks (for details see <a href="#applsci-11-11697-t001" class="html-table">Table 1</a>) are used to learn the importance mask <math display="inline"><semantics> <msub> <mi>M</mi> <mi>i</mi> </msub> </semantics></math> for the input image feature <math display="inline"><semantics> <msub> <mi>X</mi> <mi>i</mi> </msub> </semantics></math>, the output is the element-wise multiplication <math display="inline"><semantics> <mrow> <mover accent="true"> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>˜</mo> </mover> <mo>=</mo> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>⊙</mo> <msub> <mi>M</mi> <mi>i</mi> </msub> </mrow> </semantics></math>. GAP: Global Average Pooling. Full article ">Figure 5
Examples of B-line regions annotation used for spatial attention task. A straight yellow line was drawn on the B-line region extending from the surface of the lung distally following the direction of propagation of the sound waves. Full article ">Figure 6
Examples of spatial attention map for B-line localization task. Our spatial attention module can automatically highlight B-line regions (red areas) and avoid irrelevant regions corresponding to no-B-line regions or background. Yellow straight lines represent ground truth. Correlation coefficient values (r) are presented at the bottom of each attention map. Full article ">Figure 7
Top: An example of polar coordinates applied to a sample B-line frame; the red cross shows the beam source. Center: The generated 1-dimensional diagram showing it’s related ground truth (green line, the black line is normal distribution), and Bottom: attention map values (red line) across the coordinates. In this example, the correlation coefficient value is r = 0.71. Full article ">Figure 8
Generated temporal (horizontal axis) and spatial (heatmap overlaid onto B-Mode B-line frames) attentions estimated by our model on an example of an LUS video that includes both B-line and non-B-line frames. The top graph shows the temporal attention weights (in blue) and the corresponding ground truth annotations (in green). Spatial attention maps are visualized for B-line frames (for example frames 16 and 22): the yellow lines show the manual B-line annotations and, the correlation coefficient values (r), computed as described in the text, are presented at the bottom of each frame for illustration. Full article ">

29 pages, 1759 KiB

Open AccessArticle

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

by Cristina Luna-Jiménez, David Griol, Zoraida Callejas, Ricardo Kleinlein, Juan M. Montero and Fernando Fernández-Martínez

Sensors 2021, 21(22), 7665; https://doi.org/10.3390/s21227665 - 18 Nov 2021

Cited by 76 | Viewed by 10998

Abstract

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech [...] Read more.

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance. Full article

(This article belongs to the Special Issue Multimodal Emotion Recognition in Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 9636 KiB

Open AccessArticle

Video Desnowing and Deraining via Saliency and Dual Adaptive Spatiotemporal Filtering

by Yongji Li, Rui Wu, Zhenhong Jia, Jie Yang and Nikola Kasabov

Sensors 2021, 21(22), 7610; https://doi.org/10.3390/s21227610 - 16 Nov 2021

Cited by 5 | Viewed by 2300

Abstract

Outdoor vision sensing systems often struggle with poor weather conditions, such as snow and rain, which poses a great challenge to existing video desnowing and deraining methods. In this paper, we propose a novel video desnowing and deraining model that utilizes the salience [...] Read more.

Outdoor vision sensing systems often struggle with poor weather conditions, such as snow and rain, which poses a great challenge to existing video desnowing and deraining methods. In this paper, we propose a novel video desnowing and deraining model that utilizes the salience information of moving objects to address this problem. First, we remove the snow and rain from the video by low-rank tensor decomposition, which makes full use of the spatial location information and the correlation between the three channels of the color video. Second, because existing algorithms often regard sparse snowflakes and rain streaks as moving objects, this paper injects salience information into moving object detection, which reduces the false alarms and missed alarms of moving objects. At the same time, feature point matching is used to mine the redundant information of moving objects in continuous frames, and a dual adaptive minimum filtering algorithm in the spatiotemporal domain is proposed by us to remove snow and rain in front of moving objects. Both qualitative and quantitative experimental results show that the proposed algorithm is more competitive than other state-of-the-art snow and rain removal methods. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI