[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (35)

Search Parameters:
Keywords = video saliency model

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 9404 KiB  
Article
SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking
by Yan Ding, Yuchen Ling, Bozhi Zhang, Jiaxin Li, Lingxi Guo and Zhe Yang
Sensors 2024, 24(18), 6015; https://doi.org/10.3390/s24186015 - 17 Sep 2024
Cited by 1 | Viewed by 1286
Abstract
Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient [...] Read more.
Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient target loss in real-world scenarios. This paper rethinks state prediction and fusion based on target temporal features to address these issues and proposes the SimpleTrackV2 algorithm, building on the previously designed SimpleTrack. Firstly, to address the poor prediction performance of linear motion models in complex scenes, we designed a target state prediction algorithm called LSTM-MP, based on long short-term memory (LSTM). This algorithm encodes the target’s historical motion information using LSTM and decodes it with a multilayer perceptron (MLP) to achieve target state prediction. Secondly, to mitigate the effect of occlusion on target state saliency, we designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) target state fusion algorithm based on the attention mechanism. TSA-FF calculates adaptive fusion coefficients to enhance target state fusion, thereby improving the accuracy of subsequent data association. To demonstrate the effectiveness of the proposed method, we compared SimpleTrackV2 with the baseline model SimpleTrack on the MOT17 dataset. We also conducted ablation experiments on TSA-FF and LSTM-MP for SimpleTrackV2, exploring the optimal number of fusion frames and the impact of different loss functions on model performance. The experimental results show that SimpleTrackV2 handles camera jitter and target occlusion better, achieving improvements of 1.6%, 3.2%, and 6.1% in MOTA, IDF1, and HOTA, respectively, compared to the SimpleTrack algorithm. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>Motion state differences and linear Kalman filter predictions for the pedestrian target with video sequence ID 23 in MOT17-09. The linear Kalman filter predicts the center point coordinates and the width and height of the target’s bounding box based on a linear motion model. The black curve represents the frame-to-frame differences in the Kalman filter’s predictions, the blue and green curves show the differences in the horizontal and vertical coordinates of the target’s center point, respectively, the purple and orange curves illustrate the differences in the target’s width and height, and the red and cyan curves depict the intersection over union (IoU) of the predicted bounding boxes and the degree of target occlusion.</p>
Full article ">Figure 2
<p>The target state prediction and fusion pipeline in SimpleTrackV2. SimpleTrackV2 inherits the feature decoupling and association components from SimpleTrack, with the improvements highlighted in the red boxes, which primarily include the design of LSTM-MP for target state prediction and TSA-FF for target state fusion. In the diagram, the blue output lines from feature decoupling represent the extracted positional features, while the purple arrows indicate the extracted appearance features. The LSTM-MP model utilizes the differences in the target’s motion state between consecutive frames as input to predict future motion states. The TSA-FF algorithm computes temporal and spatial attention weights to determine fusion weight coefficients, facilitating the effective fusion of target states.</p>
Full article ">Figure 3
<p>The LSTM-MP model begins by applying a feature transformation method to convert the motion features of the target’s historical frames into the difference between its front and back frame motion states. This difference is then used as input to the model. The LSTM-MP model encodes and reduces the dimensionality of these features through a multilayer perceptron (MLP) network. Finally, an inverse feature transformation is applied to predict the target’s future motion state, thereby achieving effective motion state prediction.</p>
Full article ">Figure 4
<p>The overall structure of the LSTM-MP model.</p>
Full article ">Figure 5
<p>The TSA-FF algorithm first computes the temporal and spatial attention weights for the appearance feature vectors of different time frames for the same target. It then sums the self-spatial attention coefficients of each historical frame with the interaction spatial attention coefficients and multiplies this sum by the temporal attention weights to determine the final fusion coefficients. Finally, the appearance features of each historical frame are weighted and averaged using these fusion coefficients, followed by normalization to obtain the fused target appearance features.</p>
Full article ">
19 pages, 9912 KiB  
Article
Research on VVC Intra-Frame Bit Allocation Scheme Based on Significance Detection
by Xuesong Jin, Huiyuan Sun and Yuhang Zhang
Appl. Sci. 2024, 14(1), 471; https://doi.org/10.3390/app14010471 - 4 Jan 2024
Cited by 1 | Viewed by 2999
Abstract
This research is based on an intra-frame rate control algorithm based on the Versatile Video Coding (VVC) standard, considering that there is the phenomenon of over-allocating the bitrate of the end coding tree units (CTUs) in the bit allocation process, while the front [...] Read more.
This research is based on an intra-frame rate control algorithm based on the Versatile Video Coding (VVC) standard, considering that there is the phenomenon of over-allocating the bitrate of the end coding tree units (CTUs) in the bit allocation process, while the front CTUs are not effectively compressed. Fusing a Canny-based edge detection algorithm, a color contrast-based saliency detection algorithm, a Sum of Absolute Transformed Differences (SATD) based CTU coding complexity measure, and a Partial Least Squares (PLS) regression model, this paper proposes a CTU-level bit allocation improvement scheme for intra-mode code rate control of the VVC standard. First, natural images are selected to produce a lightweight dataset. Second, different metrics are utilized to obtain the significance and complexity values of each coding unit, the relatively important coding units in the whole frame are selected, which are adjusted with different weights, and the optimal adjustment multiplicity is supplemented into the dataset. Finally, the PLS regression model was used to obtain regression equations to refine the weights for adjusting the bit allocation. The proposed bit allocation scheme improves the average rate control accuracy by 0.453%, Y-PSNR by 0.05 dB, BD-rate savings by 0.33%, and BD-PSNR by 0.03 dB compared to the VVC standard rate control algorithm. Full article
Show Figures

Figure 1

Figure 1
<p>Development process of video coding standards in various series.</p>
Full article ">Figure 2
<p>The bpp situation of all CTUs in the first frame of intra-frame mode under the conditions of the default rate control algorithm. (<b>a</b>) BQMall; (<b>b</b>) Cactus; (<b>c</b>) BQSquare; (<b>d</b>) Fourpeople.</p>
Full article ">Figure 3
<p>The overall flow of the proposed algorithm.</p>
Full article ">Figure 4
<p>Significance information extraction using mv5 as an example. (<b>a</b>) original image; (<b>b</b>) grayscale image; (<b>c</b>) significant map based on Canny algorithm; and (<b>d</b>) significant map based on color contrast algorithm.</p>
Full article ">Figure 5
<p>Feature information of mv5 as an example. (<b>a</b>) coding complexity based on SATD; (<b>b</b>) significance value based on Canny algorithm; (<b>c</b>) significance value based on color contrast algorithm.</p>
Full article ">Figure 6
<p>Relationship between multiplicity and rate control efficiency for the example of mv5.</p>
Full article ">Figure 7
<p>Regression model analysis for improved CTUs.</p>
Full article ">Figure 8
<p>Bit tuning for a test sequence with a set maximum bit cost. (<b>a</b>) PeopleOnStreet; (<b>b</b>) BasketballDrive; (<b>c</b>) BQMall; (<b>d</b>) RaceHorses; (<b>e</b>) FourPeople; (<b>f</b>) BasketballDrillText.</p>
Full article ">
21 pages, 4424 KiB  
Article
A Salient Object Detection Method Based on Boundary Enhancement
by Falin Wen, Qinghui Wang, Ruirui Zou, Ying Wang, Fenglin Liu, Yang Chen, Linghao Yu, Shaoyi Du and Chengzhi Yuan
Sensors 2023, 23(16), 7077; https://doi.org/10.3390/s23167077 - 10 Aug 2023
Viewed by 1556
Abstract
Visual saliency refers to the human’s ability to quickly focus on important parts of their visual field, which is a crucial aspect of image processing, particularly in fields like medical imaging and robotics. Understanding and simulating this mechanism is crucial for solving complex [...] Read more.
Visual saliency refers to the human’s ability to quickly focus on important parts of their visual field, which is a crucial aspect of image processing, particularly in fields like medical imaging and robotics. Understanding and simulating this mechanism is crucial for solving complex visual problems. In this paper, we propose a salient object detection method based on boundary enhancement, which is applicable to both 2D and 3D sensors data. To address the problem of large-scale variation of salient objects, our method introduces a multi-level feature aggregation module that enhances the expressive ability of fixed-resolution features by utilizing adjacent features to complement each other. Additionally, we propose a multi-scale information extraction module to capture local contextual information at different scales for back-propagated level-by-level features, which allows for better measurement of the composition of the feature map after back-fusion. To tackle the low confidence issue of boundary pixels, we also introduce a boundary extraction module to extract the boundary information of salient regions. This information is then fused with salient target information to further refine the saliency prediction results. During the training process, our method uses a mixed loss function to constrain the model training from two levels: pixels and images. The experimental results demonstrate that our salient target detection method based on boundary enhancement shows good detection effects on targets of different scales, multi-targets, linear targets, and targets in complex scenes. We compare our method with the best method in four conventional datasets and achieve an average improvement of 6.2% on the mean absolute error (MAE) indicators. Overall, our approach shows promise for improving the accuracy and efficiency of salient object detection in a variety of settings, including those involving 2D/3D semantic analysis and reconstruction/inpainting of image/video/point cloud data. Full article
(This article belongs to the Special Issue Machine Learning Based 2D/3D Sensors Data Understanding and Analysis)
Show Figures

Figure 1

Figure 1
<p>The overall framework of the salient target detection method based on boundary enhancement.</p>
Full article ">Figure 2
<p>Detailed network structure diagram, taking ResNet-50 as an example.</p>
Full article ">Figure 3
<p>Structure diagram of multi-level feature aggregation module.</p>
Full article ">Figure 4
<p>Multi-scale information extraction module.</p>
Full article ">Figure 5
<p>Boundary extraction module.</p>
Full article ">Figure 6
<p>A visualization of the ablation experiment. From left to right are the original image, the truth image, the baseline result, and the result of this study.</p>
Full article ">Figure 7
<p>PR curves and F-measure curves.</p>
Full article ">Figure 8
<p>The visualization comparison chart of the comparative experiment. The left two columns display the original image and the ground truth image, while the remaining columns show the results of the proposed method and other methods.</p>
Full article ">
22 pages, 2661 KiB  
Article
Human Action Representation Learning Using an Attention-Driven Residual 3DCNN Network
by Hayat Ullah and Arslan Munir
Algorithms 2023, 16(8), 369; https://doi.org/10.3390/a16080369 - 31 Jul 2023
Cited by 2 | Viewed by 1674
Abstract
The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms [...] Read more.
The recognition of human activities using vision-based techniques has become a crucial research field in video analytics. Over the last decade, there have been numerous advancements in deep learning algorithms aimed at accurately detecting complex human actions in video streams. While these algorithms have demonstrated impressive performance in activity recognition, they often exhibit a bias towards either model performance or computational efficiency. This biased trade-off between robustness and efficiency poses challenges when addressing complex human activity recognition problems. To address this issue, this paper presents a computationally efficient yet robust approach, exploiting saliency-aware spatial and temporal features for human action recognition in videos. To achieve effective representation of human actions, we propose an efficient approach called the dual-attentional Residual 3D Convolutional Neural Network (DA-R3DCNN). Our proposed method utilizes a unified channel-spatial attention mechanism, allowing it to efficiently extract significant human-centric features from video frames. By combining dual channel-spatial attention layers with residual 3D convolution layers, the network becomes more discerning in capturing spatial receptive fields containing objects within the feature maps. To assess the effectiveness and robustness of our proposed method, we have conducted extensive experiments on four well-established benchmark datasets for human action recognition. The quantitative results obtained validate the efficiency of our method, showcasing significant improvements in accuracy of up to 11% as compared to state-of-the-art human action recognition methods. Additionally, our evaluation of inference time reveals that the proposed method achieves up to a 74× improvement in frames per second (FPS) compared to existing approaches, thus showing the suitability and effectiveness of the proposed DA-R3DCNN for real-time human activity recognition. Full article
(This article belongs to the Special Issue Algorithms for Image Processing and Machine Vision)
Show Figures

Figure 1

Figure 1
<p>The graphical abstract of our proposed DA-R3DCNN network architecture.</p>
Full article ">Figure 2
<p>The visual overview of the utilized 3D residual convolutional block used in this study, with convoluted shortcut path.</p>
Full article ">Figure 3
<p>The visual overview of the dual channel-spatial attention module.</p>
Full article ">Figure 4
<p>Architecture of the dual channel-spatial attention module.</p>
Full article ">Figure 5
<p>Confusion matrices computed for the proposed DA-R3DCNN framework for the test sets of four experimented datasets: (<b>a</b>) UCF11 dataset, (<b>b</b>) UCF50 dataset, (<b>c</b>) HMDB51 dataset, and (<b>d</b>) UCF101 dataset.</p>
Full article ">Figure 6
<p>The graphical overview of the conducted comparative analysis of our proposed DA-R3DCNN with the state-of-the-art methods on (<b>a</b>) UCF11 dataset, (<b>b</b>) UCF50 dataset, (<b>c</b>) HMDB51 dataset, and (<b>d</b>) UCF101 dataset.</p>
Full article ">
20 pages, 2466 KiB  
Article
SDebrisNet: A Spatial–Temporal Saliency Network for Space Debris Detection
by Jiang Tao, Yunfeng Cao and Meng Ding
Appl. Sci. 2023, 13(8), 4955; https://doi.org/10.3390/app13084955 - 14 Apr 2023
Cited by 10 | Viewed by 3327
Abstract
The rapidly growing number of space activities is generating numerous space debris, which greatly threatens the safety of space operations. Therefore, space-based space debris surveillance is crucial for the early avoidance of spacecraft emergencies. With the progress in computer vision technology, space debris [...] Read more.
The rapidly growing number of space activities is generating numerous space debris, which greatly threatens the safety of space operations. Therefore, space-based space debris surveillance is crucial for the early avoidance of spacecraft emergencies. With the progress in computer vision technology, space debris detection using optical sensors has become a promising solution. However, detecting space debris at far ranges is challenging due to its limited imaging size and unknown movement characteristics. In this paper, we propose a space debris saliency detection algorithm called SDebrisNet. The algorithm utilizes a convolutional neural network (CNN) to take into account both spatial and temporal data from sequential video images, which aim to assist in detecting small and moving space debris. Firstly, taking into account the limited resource of the space-based computational platform, a MobileNet-based space debris feature extraction structure was constructed to make the overall model more lightweight. In particular, an enhanced spatial feature module is introduced to strengthen the spatial details of small objects. Secondly, based on attention mechanisms, a constrained self-attention (CSA) module is applied to learn the spatiotemporal data from the sequential images. Finally, a space debris dataset was constructed for algorithm evaluation. The experimental results demonstrate that the method proposed in this paper is robust for detecting moving space debris with a low signal-to-noise ratio in the video. Compared to the NODAMI method, SDebrisNet shows improvements of 3.5% and 1.7% in terms of detection probability and the false alarm rate, respectively. Full article
(This article belongs to the Special Issue Vision-Based Autonomous Unmanned Systems: Challenges and Approaches)
Show Figures

Figure 1

Figure 1
<p>Flow chart of the proposed space debris detection method.</p>
Full article ">Figure 2
<p>Detailed illustration of the proposed saliency detection network, which consists of the spatial feature extraction module, spatial feature enhancement module, temporal feature extraction module, and saliency prediction module.</p>
Full article ">Figure 3
<p>Spatial feature extraction module. The black squares on each line denote the first five outputted feature maps of each block. The number at the end of each line means the number of extracted feature maps.</p>
Full article ">Figure 4
<p>Spatial feature enhancement module.</p>
Full article ">Figure 5
<p>Illustration of the constrained self-attention (CSA) network. The video clip includes four frames as examples in this figure.</p>
Full article ">Figure 6
<p>The space positions of space debris at different motion speeds. The space debris cross three consecutive frames at two different speeds, where <inline-formula><mml:math id="mm103"><mml:semantics><mml:mrow><mml:msub><mml:mi>v</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&gt;</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:semantics></mml:math></inline-formula>.</p>
Full article ">Figure 7
<p>Example of a video sequence with SNR = 0.2, diameter = 0.1 m, speed = 4 pixel/frame, direction = 10<inline-formula><mml:math id="mm104"><mml:semantics><mml:msup><mml:mrow/><mml:mo>∘</mml:mo></mml:msup></mml:semantics></mml:math></inline-formula> in SDD dataset (top row). Close-up of space debris marked by a green circle in the image sequence (bottom row).</p>
Full article ">Figure 8
<p>Saliency detection results of SDebrisNet without different components: all of the frames are collapsed by a max operator.</p>
Full article ">Figure 9
<p>Two real video datasets: All of the frames are collapsed by a max operator. Video 1 includes two space debris with linear tracks. Video 2 includes one space object with a curved track.</p>
Full article ">Figure 10
<p>Example of the detected centroid on real video sequence 1 (<bold>top row</bold>) and video sequence 2 (<bold>bottom row</bold>). The detected centroid coordinates and missed detections are marked in green circles and red circles, respectively.</p>
Full article ">
17 pages, 5104 KiB  
Article
Video Saliency Object Detection with Motion Quality Compensation
by Hengsen Wang, Chenglizhao Chen, Linfeng Li and Chong Peng
Electronics 2023, 12(7), 1618; https://doi.org/10.3390/electronics12071618 - 30 Mar 2023
Cited by 1 | Viewed by 1721
Abstract
Video saliency object detection is one of the classic research problems in computer vision, yet existing works rarely focus on the impact of input quality on model performance. As optical flow is a key input for video saliency detection models, its quality significantly [...] Read more.
Video saliency object detection is one of the classic research problems in computer vision, yet existing works rarely focus on the impact of input quality on model performance. As optical flow is a key input for video saliency detection models, its quality significantly affects model performance. Traditional optical flow models only calculate the optical flow between two consecutive video frames, ignoring the motion state of objects over a period of time, leading to low-quality optical flow and reduced performance of video saliency object detection models. Therefore, this paper proposes a new optical flow model that improves the quality of optical flow by expanding the flow perception range and uses high-quality optical flow to enhance the performance of video saliency object detection models. Experimental results on the datasets show that the proposed optical flow model can significantly improve optical flow quality, with the S-M values on the DAVSOD dataset increasing by about 39%, 49%, and 44% compared to optical flow models such as PWCNet, SpyNet, and LFNet. In addition, experiments that fine-tuning the benchmark model LIMS demonstrate that improving input quality can further improve model performance. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) It represents traditional optical flow models that only use two frames of images for optical flow calculation. (<b>b</b>) Optical flow maps generated by the traditional optical flow models.</p>
Full article ">Figure 2
<p>The (<b>a</b>) The new optical flow model proposed in this paper. (<b>b</b>) Fused high-quality optical flow maps and RGB images using a traditional feature fusion module to obtain a saliency map.</p>
Full article ">Figure 3
<p>Color saliency is obtained from RGB images through the color saliency module, while motion saliency maps are obtained from optical flow maps through the motion saliency module.</p>
Full article ">Figure 4
<p>Comparison of optical flow maps generated by different optical flows in multiple scenes, “Ours” represents the optical flow map generated by the optical flow model proposed in this paper.</p>
Full article ">Figure 5
<p>High-quality motion saliency maps and color saliency maps often have higher consistency in structure.</p>
Full article ">Figure 6
<p>Comparison of motion saliency maps generated by four different optical flow models, with “Ours” representing the motion saliency map generated by the optical flow model proposed in this paper.</p>
Full article ">Figure 7
<p>Comparison of saliency maps generated by different VSOD models, “LIMS+Our” refers to the saliency map generated by fine-tuning the LIMS model using the optical flow model proposed in this paper.</p>
Full article ">
16 pages, 8732 KiB  
Article
Just Noticeable Difference Model for Images with Color Sensitivity
by Zhao Zhang, Xiwu Shang, Guoping Li and Guozhong Wang
Sensors 2023, 23(5), 2634; https://doi.org/10.3390/s23052634 - 27 Feb 2023
Cited by 1 | Viewed by 3433
Abstract
The just noticeable difference (JND) model reflects the visibility limitations of the human visual system (HVS), which plays an important role in perceptual image/video processing and is commonly applied to perceptual redundancy removal. However, existing JND models are usually constructed by treating the [...] Read more.
The just noticeable difference (JND) model reflects the visibility limitations of the human visual system (HVS), which plays an important role in perceptual image/video processing and is commonly applied to perceptual redundancy removal. However, existing JND models are usually constructed by treating the color components of three channels equally, and their estimation of the masking effect is inadequate. In this paper, we introduce visual saliency and color sensitivity modulation to improve the JND model. Firstly, we comprehensively combined contrast masking, pattern masking, and edge protection to estimate the masking effect. Then, the visual saliency of HVS was taken into account to adaptively modulate the masking effect. Finally, we built color sensitivity modulation according to the perceptual sensitivities of HVS, to adjust the sub-JND thresholds of Y, Cb, and Cr components. Thus, the color-sensitivity-based JND model (CSJND) was constructed. Extensive experiments and subjective tests were conducted to verify the effectiveness of the CSJND model. We found that consistency between the CSJND model and HVS was better than existing state-of-the-art JND models. Full article
(This article belongs to the Special Issue Image/Signal Processing and Machine Vision in Sensing Applications)
Show Figures

Figure 1

Figure 1
<p>The framework of the proposed CSJND model.</p>
Full article ">Figure 2
<p>An example of JND generation and a contaminated image guided by JND noise: (<b>a</b>) the original image; (<b>b</b>) response map for contrast masking of Y component; (<b>c</b>) response map for pattern masking of Y component; (<b>d</b>) saliency prediction map; (<b>e</b>) JND map of Y component; and (<b>f</b>) JND-contaminated image, with PSNR = 27.00 dB.</p>
Full article ">Figure 3
<p>The Prewitt kernels in vertical and horizontal directions.</p>
Full article ">Figure 4
<p>The comparison of contaminated images from JND models based on different proposed factors. The contaminated images have the same level of noise, with PSNR = 28.25 dB. (<b>a</b>) The original image. (<b>b</b>) The basic model <math display="inline"><semantics> <msubsup> <mrow> <mi>J</mi> <mspace width="-0.166667em"/> <mi>N</mi> <mspace width="-0.166667em"/> <mi>D</mi> </mrow> <mrow> <mi>θ</mi> </mrow> <mi>B</mi> </msubsup> </semantics></math>, VMAF = 80.10. (<b>c</b>) The model <math display="inline"><semantics> <msubsup> <mrow> <mi>J</mi> <mspace width="-0.166667em"/> <mi>N</mi> <mspace width="-0.166667em"/> <mi>D</mi> </mrow> <mrow> <mi>θ</mi> </mrow> <mi>S</mi> </msubsup> </semantics></math> based on the basic model and saliency modulation, with VMAF = 84.42. (<b>d</b>) The model <math display="inline"><semantics> <msubsup> <mrow> <mi>J</mi> <mspace width="-0.166667em"/> <mi>N</mi> <mspace width="-0.166667em"/> <mi>D</mi> </mrow> <mrow> <mi>θ</mi> </mrow> <mi>C</mi> </msubsup> </semantics></math> based on the basic model and color sensitivity modulation, with VMAF = 88.04. (<b>e</b>) The proposed model <math display="inline"><semantics> <mrow> <mi>C</mi> <mspace width="-0.166667em"/> <mi>S</mi> <mspace width="-0.166667em"/> <mi>J</mi> <mspace width="-0.166667em"/> <mi>N</mi> <mspace width="-0.166667em"/> <msub> <mi>D</mi> <mi>θ</mi> </msub> </mrow> </semantics></math>, with VMAF = 94.75.</p>
Full article ">Figure 5
<p>An example of the comparison of contaminated images from different JND models. The contaminated images have the same level of noise, with PSNR = 28.91 dB. (<b>a</b>) The original image; (<b>b</b>) Wu2013, VMAF = 82.65; (<b>c</b>) Wu2017, VMAF = 83.41; (<b>d</b>) Chen2019, VMAF = 87.44; (<b>e</b>) Jiang2022, VMAF = 90.34; and (<b>f</b>) The proposed CSJND model, VMAF = 94.99.</p>
Full article ">Figure 6
<p>The set of test images, in order from I1–I12.</p>
Full article ">
20 pages, 4998 KiB  
Article
Quality-Driven Dual-Branch Feature Integration Network for Video Salient Object Detection
by Xiaofei Zhou, Hanxiao Gao, Longxuan Yu, Defu Yang and Jiyong Zhang
Electronics 2023, 12(3), 680; https://doi.org/10.3390/electronics12030680 - 29 Jan 2023
Cited by 4 | Viewed by 1673
Abstract
Video salient object detection has attracted growing interest in recent years. However, some existing video saliency models often suffer from the inappropriate utilization of spatial and temporal cues and the insufficient aggregation of different level features, leading to remarkable performance degradation. Therefore, we [...] Read more.
Video salient object detection has attracted growing interest in recent years. However, some existing video saliency models often suffer from the inappropriate utilization of spatial and temporal cues and the insufficient aggregation of different level features, leading to remarkable performance degradation. Therefore, we propose a quality-driven dual-branch feature integration network majoring in the adaptive fusion of multi-modal cues and sufficient aggregation of multi-level spatiotemporal features. Firstly, we employ the quality-driven multi-modal feature fusion (QMFF) module to combine the spatial and temporal features. Particularly, the quality scores estimated from each level’s spatial and temporal cues are not only used to weigh the two modal features but also to adaptively integrate the coarse spatial and temporal saliency predictions into the guidance map, which further enhances the two modal features. Secondly, we deploy the dual-branch-based multi-level feature aggregation (DMFA) module to integrate multi-level spatiotemporal features, where the two branches including the progressive decoder branch and the direct concatenation branch sufficiently explore the cooperation of multi-level spatiotemporal features. In particular, in order to provide an adaptive fusion for the outputs of the two branches, we design the dual-branch fusion (DF) unit, where the channel weight of each output can be learned jointly from the two outputs. The experiments conducted on four video datasets clearly demonstrate the effectiveness and superiority of our model against the state-of-the-art video saliency models. Full article
Show Figures

Figure 1

Figure 1
<p>The architecture of the proposed video saliency model: the inputs are current frame image <math display="inline"><semantics> <mi mathvariant="bold">I</mi> </semantics></math> and its optical flow image <math display="inline"><semantics> <mi mathvariant="bold">M</mi> </semantics></math>. Firstly, the encoder network is employed to generate the appearance features <math display="inline"><semantics> <msubsup> <mrow> <mo>{</mo> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mi>A</mi> </msubsup> <mo>}</mo> </mrow> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>5</mn> </msubsup> </semantics></math> and the motion features <math display="inline"><semantics> <msubsup> <mrow> <mo>{</mo> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mi>M</mi> </msubsup> <mo>}</mo> </mrow> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>5</mn> </msubsup> </semantics></math>. Then, the two modal features are combined by the quality-driven multi-modal feature fusion (QMFF) module, yielding the multi-level spatiotemporal deep features <math display="inline"><semantics> <msubsup> <mrow> <mo>{</mo> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>S</mi> <mi>T</mi> </mrow> </msubsup> <mo>}</mo> </mrow> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>5</mn> </msubsup> </semantics></math>. Next, we deploy the inter-level feature interaction (IFI) unit on each level spatiotemporal feature to obtain the enhanced spatiotemporal features <math display="inline"><semantics> <msubsup> <mrow> <mo>{</mo> <msub> <mi mathvariant="bold">F</mi> <mi>i</mi> </msub> <mo>}</mo> </mrow> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>5</mn> </msubsup> </semantics></math>. After that, we deploy the dual-branched-based multi-level feature aggregation (DMFA) module to integrate the spatiotemporal features, yielding deep decoding features <math display="inline"><semantics> <msubsup> <mrow> <mo>{</mo> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mi>D</mi> </msubsup> <mo>}</mo> </mrow> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>5</mn> </msubsup> </semantics></math> and concatenation features <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>C</mi> <mo>*</mo> </msubsup> </semantics></math>. Successively, we use the dual-branch fusion (DF) unit to integrate the two features including <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>C</mi> <mo>*</mo> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mn>1</mn> <mi>D</mi> </msubsup> </semantics></math>, generating the final high-quality saliency map <math display="inline"><semantics> <mi mathvariant="bold">S</mi> </semantics></math>. Here, <math display="inline"><semantics> <msub> <mi>l</mi> <mn>3</mn> </msub> </semantics></math> is the supervision.</p>
Full article ">Figure 2
<p>Detail structure of the quality-driven multi-modal feature fusion (QMFF) module. Here, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">P</mi> <mi>i</mi> <mi>A</mi> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">P</mi> <mi>i</mi> <mi>M</mi> </msubsup> </semantics></math> refer to the coarse saliency predictions, <math display="inline"><semantics> <msubsup> <mi>Q</mi> <mi>i</mi> <mi>A</mi> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>Q</mi> <mi>i</mi> <mi>M</mi> </msubsup> </semantics></math> denote the quality scores, and <math display="inline"><semantics> <msub> <mi mathvariant="bold">P</mi> <mi>i</mi> </msub> </semantics></math> is the guidance map. Moreover, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mi>A</mi> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mi>M</mi> </msubsup> </semantics></math> are the <math display="inline"><semantics> <msup> <mi>i</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msup> </semantics></math> level deep features, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>A</mi> <mi>q</mi> </mrow> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>M</mi> <mi>q</mi> </mrow> </msubsup> </semantics></math> refer to the weighted features, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>A</mi> <mi>r</mi> </mrow> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>M</mi> <mi>r</mi> </mrow> </msubsup> </semantics></math> are the refined features, and <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>i</mi> <mrow> <mi>S</mi> <mi>T</mi> </mrow> </msubsup> </semantics></math> is the spatiotemporal deep feature.</p>
Full article ">Figure 3
<p>Illustration of the dual-branch fusion (DF) unit. Here, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mn>1</mn> <mi>D</mi> </msubsup> </semantics></math> is the output of the progressive decoder branch, <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>C</mi> <mo>*</mo> </msubsup> </semantics></math> is the fused feature, <math display="inline"><semantics> <mrow> <mi>F</mi> <mi>C</mi> <mo>(</mo> <mo>·</mo> <mo>)</mo> </mrow> </semantics></math> is the fully connected layer, and <math display="inline"><semantics> <mi mathvariant="bold">w</mi> </semantics></math> is the feature weight, which can be divided into two sub-feature weights <math display="inline"><semantics> <msub> <mi mathvariant="bold">w</mi> <mi>C</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi mathvariant="bold">w</mi> <mi>D</mi> </msub> </semantics></math>. <math display="inline"><semantics> <mi mathvariant="bold">S</mi> </semantics></math> is the final saliency map, <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>p</mi> <mi>l</mi> <mi>i</mi> <msub> <mi>t</mi> <mi>w</mi> </msub> <mrow> <mo>(</mo> <mo>·</mo> <mo>)</mo> </mrow> </mrow> </semantics></math> is the split operation, and <math display="inline"><semantics> <msub> <mi>l</mi> <mi>i</mi> </msub> </semantics></math><math display="inline"><semantics> <mrow> <mo>(</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>3</mn> <mo>)</mo> </mrow> </semantics></math> is the supervision.</p>
Full article ">Figure 4
<p>Visualization of the dual-branch fusion (DF) unit. (<b>a</b>): Input frame, (<b>b</b>): Ground truth, (<b>c</b>): Saliency predictions of <math display="inline"><semantics> <msub> <mi>l</mi> <mn>3</mn> </msub> </semantics></math>, (<b>d</b>): Feature map of <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mn>1</mn> <mi>D</mi> </msubsup> </semantics></math>, (<b>e</b>): Saliency predictions of <math display="inline"><semantics> <msub> <mi>l</mi> <mn>1</mn> </msub> </semantics></math>, (<b>f</b>): Feature map of <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">F</mi> <mi>C</mi> <mo>*</mo> </msubsup> </semantics></math>, (<b>g</b>): Saliency predictions of <math display="inline"><semantics> <msub> <mi>l</mi> <mn>2</mn> </msub> </semantics></math>.</p>
Full article ">Figure 5
<p>(Better viewed in color) Quantitative comparison results of different video saliency models: presents (<b>a</b>) PR curves and (<b>b</b>) F-measure curves on DAVIS dataset, presents (<b>c</b>) PR curves and (<b>d</b>) F-measure curves on DAVSOD dataset, presents (<b>e</b>) PR curves and (<b>f</b>) F-measure curves on ViSal dataset, presents (<b>g</b>) PR curves and (<b>h</b>) F-measure curves on SegV2 dataset.</p>
Full article ">Figure 6
<p>Qualitative comparison results of different video saliency models on several challenging videos of DAVIS dataset. (<b>a</b>): Video frames, (<b>b</b>): GT, (<b>c</b>): OUR, (<b>d</b>): CAG-DDE, (<b>e</b>): DCFNet, (<b>f</b>): STFA, (<b>g</b>): GTNet, (<b>h</b>): MAGCN, (<b>i</b>): PCSA, (<b>j</b>): MGA, (<b>k</b>): SSAV, (<b>l</b>): PDB, (<b>m</b>): MBNM, (<b>n</b>): FGRNE, (<b>o</b>): SCOM, (<b>p</b>): SCNN, (<b>q</b>): SFLR, (<b>r</b>): STBP, and (<b>s</b>): SGSP.</p>
Full article ">Figure 7
<p>Qualitative comparison results of different video saliency models on several challenging videos of DAVSOD dataset. (<b>a</b>): Video frames, (<b>b</b>): GT, (<b>c</b>): OUR, (<b>d</b>): CAG-DDE, (<b>e</b>): DCFNet, (<b>f</b>): STFA, (<b>g</b>): GTNet, (<b>h</b>): PCSA, (<b>i</b>): MGA, (<b>j</b>): SSAV, (<b>k</b>): MBNM, (<b>l</b>): FGRNE, (<b>m</b>): SCOM, (<b>n</b>): SCNN, and (<b>o</b>): STBP.</p>
Full article ">Figure 8
<p>Qualitative comparisons of four variants of our model. (<b>a</b>): Input frames, (<b>b</b>): GT, (<b>c</b>): w/o QMFF-qf, (<b>d</b>): w/o QMFF-f, (<b>e</b>): w/o QMFF-qp, (<b>f</b>): QMFF-q, and (<b>g</b>): OUR.</p>
Full article ">Figure 9
<p>Qualitative comparisons of the variant of our model. (<b>a</b>): Input frames, (<b>b</b>): GT, (<b>c</b>): w/o IFI, and (<b>d</b>): OUR.</p>
Full article ">Figure 10
<p>Qualitative comparisons of four variants of our model. (<b>a</b>): Input frames, (<b>b</b>): GT, (<b>c</b>): w/o db1, (<b>d</b>): w/o db2, (<b>e</b>): w/o DF, (<b>f</b>): w BiFPN, and (<b>g</b>): OUR.</p>
Full article ">Figure 11
<p>Qualitative comparisons of a variant of our model. (<b>a</b>): Input frames, (<b>b</b>): GT, (<b>c</b>): w lw, and (<b>d</b>): OUR.</p>
Full article ">Figure 12
<p>Some failure examples of our model. (<b>a</b>): Input frames, (<b>b</b>): Ground truth, and (<b>c</b>): Saliency maps generated by our model.</p>
Full article ">
18 pages, 3040 KiB  
Article
OTNet: A Small Object Detection Algorithm for Video Inspired by Avian Visual System
by Pingge Hu, Xingtong Wang, Xiaoteng Zhang, Yueyang Cang and Li Shi
Mathematics 2022, 10(21), 4125; https://doi.org/10.3390/math10214125 - 4 Nov 2022
Cited by 2 | Viewed by 2510
Abstract
Small object detection is one of the most challenging and non-negligible fields in computer vision. Inspired by the location–focus–identification process of the avian visual system, we present our location-focused small-object-detection algorithm for video or image sequence, OTNet. The model contains three modules corresponding [...] Read more.
Small object detection is one of the most challenging and non-negligible fields in computer vision. Inspired by the location–focus–identification process of the avian visual system, we present our location-focused small-object-detection algorithm for video or image sequence, OTNet. The model contains three modules corresponding to the forms of saliency, which drive the strongest response of OT to calculate the saliency map. The three modules are responsible for temporal–spatial feature extraction, spatial feature extraction and memory matching, respectively. We tested our model on the AU-AIR dataset and achieved up to 97.95% recall rate, 85.73% precision rate and 89.94 F1 score with a lower computational complexity. Our model is also able to work as a plugin module for other object detection models to improve their performance in bird-view images, especially for detecting smaller objects. We managed to improve the detection performance by up to 40.01%. The results show that our model performs well on the common metrics on detection, while simulating visual information processing for object localization of the avian brain. Full article
(This article belongs to the Special Issue Mathematical Method and Application of Machine Learning)
Show Figures

Figure 1

Figure 1
<p>The transverse section of midbrain showing the OT [<a href="#B31-mathematics-10-04125" class="html-bibr">31</a>].</p>
Full article ">Figure 2
<p>The structure of our algorithm.</p>
Full article ">Figure 3
<p>The detailed structures of the models. (<b>a</b>) The overall structure of the model. (<b>b</b>) The difference between OTNet and OTNet-Lite.</p>
Full article ">Figure 4
<p>The changing of the precision and recall rate through the training process. (<b>a</b>) The precision rate changing. (<b>b</b>) The recall rate changing.</p>
Full article ">Figure 5
<p>The structure of our algorithm. (<b>a</b>) The ground truth. (<b>b</b>) The KLT result. (<b>c</b>) The LK result.</p>
Full article ">Figure 6
<p>The location result of OTNet in different scenes. (<b>a</b>) Multi-objects. (<b>b</b>) Truncation. (<b>c</b>) Bird view. (<b>d</b>) Tiny objects. (<b>e</b>) Occlusion. (<b>f</b>) Low contrast.</p>
Full article ">Figure 7
<p>The classification result of OTNet-C. The larger boxes and the smaller boxes represent different categories. (<b>a</b>) Result for the parking lot scene. (<b>b</b>) Result for the circular road scene.</p>
Full article ">
24 pages, 3279 KiB  
Article
Saliency-Enabled Coding Unit Partitioning and Quantization Control for Versatile Video Coding
by Wei Li, Xiantao Jiang, Jiayuan Jin, Tian Song and Fei Richard Yu
Information 2022, 13(8), 394; https://doi.org/10.3390/info13080394 - 19 Aug 2022
Cited by 5 | Viewed by 2335
Abstract
The latest video coding standard, versatile video coding (VVC), has greatly improved coding efficiency over its predecessor standard high efficiency video coding (HEVC), but at the expense of sharply increased complexity. In the context of perceptual video coding (PVC), the visual saliency model [...] Read more.
The latest video coding standard, versatile video coding (VVC), has greatly improved coding efficiency over its predecessor standard high efficiency video coding (HEVC), but at the expense of sharply increased complexity. In the context of perceptual video coding (PVC), the visual saliency model that utilizes the characteristics of the human visual system to improve coding efficiency has become a reliable method due to advances in computer performance and visual algorithms. In this paper, a novel VVC optimization scheme compliant PVC framework is proposed, which consists of fast coding unit (CU) partition algorithm and quantization control algorithm. Firstly, based on the visual saliency model, we proposed a fast CU division scheme, including the redetermination of the CU division depth by calculating Scharr operator and variance, as well as the executive decision for intra sub-partitions (ISP), to reduce the coding complexity. Secondly, a quantization control algorithm is proposed by adjusting the quantization parameter based on multi-level classification of saliency values at the CU level to reduce the bitrate. In comparison with the reference model, experimental results indicate that the proposed method can reduce about 47.19% computational complexity and achieve a bitrate saving of 3.68% on average. Meanwhile, the proposed algorithm has reasonable peak signal-to-noise ratio losses and nearly the same subjective perceptual quality. Full article
(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)
Show Figures

Figure 1

Figure 1
<p>Five division modes of QTMT and two examples of ISP.</p>
Full article ">Figure 2
<p>Traditional video coding framework and proposed perceptual video coding framework.</p>
Full article ">Figure 3
<p>Illustration of network for static saliency detection.</p>
Full article ">Figure 4
<p>Illustration of our network for dynamic saliency detection.</p>
Full article ">Figure 5
<p>VS map of Basketball and Kimono sequence.</p>
Full article ">Figure 6
<p>Flowchart of fast CU partitioning algorithm.</p>
Full article ">Figure 7
<p>Training results of two thresholds concerning both BDBR and time saving.</p>
Full article ">Figure 8
<p>Mapping relationship between the saliency range and complexity grade.</p>
Full article ">Figure 9
<p>RD performance of the proposed method. (<b>a</b>) RD of “PartyScene”. (<b>b</b>) RD of “FourPeople”.</p>
Full article ">
30 pages, 12327 KiB  
Article
ShadowDeNet: A Moving Target Shadow Detection Network for Video SAR
by Jinyu Bao, Xiaoling Zhang, Tianwen Zhang and Xiaowo Xu
Remote Sens. 2022, 14(2), 320; https://doi.org/10.3390/rs14020320 - 11 Jan 2022
Cited by 14 | Viewed by 3307
Abstract
Most existing SAR moving target shadow detectors not only tend to generate missed detections because of their limited feature extraction capacity among complex scenes, but also tend to bring about numerous perishing false alarms due to their poor foreground–background discrimination capacity. Therefore, to [...] Read more.
Most existing SAR moving target shadow detectors not only tend to generate missed detections because of their limited feature extraction capacity among complex scenes, but also tend to bring about numerous perishing false alarms due to their poor foreground–background discrimination capacity. Therefore, to solve these problems, this paper proposes a novel deep learning network called “ShadowDeNet” for better shadow detection of moving ground targets on video synthetic aperture radar (SAR) images. It utilizes five major tools to guarantee its superior detection performance, i.e., (1) histogram equalization shadow enhancement (HESE) for enhancing shadow saliency to facilitate feature extraction, (2) transformer self-attention mechanism (TSAM) for focusing on regions of interests to suppress clutter interferences, (3) shape deformation adaptive learning (SDAL) for learning moving target deformed shadows to conquer motion speed variations, (4) semantic-guided anchor-adaptive learning (SGAAL) for generating optimized anchors to match shadow location and shape, and (5) online hard-example mining (OHEM) for selecting typical difficult negative samples to improve background discrimination capacity. We conduct extensive ablation studies to confirm the effectiveness of the above each contribution. We perform experiments on the public Sandia National Laboratories (SNL) video SAR data. Experimental results reveal the state-of-the-art performance of ShadowDeNet, with a 66.01% best f1 accuracy, in contrast to the other five competitive methods. Specifically, ShadowDeNet is superior to the experimental baseline Faster R-CNN by a 9.00% f1 accuracy, and superior to the existing first-best model by a 4.96% f1 accuracy. Furthermore, ShadowDeNet merely sacrifices a slight detection speed in an acceptable range. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Learning Approaches for Remote Sensing)
Show Figures

Figure 1

Figure 1
<p>Relative positions between the targets and corresponding shadows. This video SAR image is the 731st frame in the SNL data.</p>
Full article ">Figure 2
<p>Shadow detection framework of ShadowDeNet. HESE denotes the histogram equalization shadow enhancement. TSAM denotes the transformer self-attention mechanism. SDAL denotes the shape deformation adaptive learning. SGAAL denotes the semantic-guided anchor-adaptive learning. OHEM denotes the online hard-example mining. In ShadowDeNet, without losing generality, we select the commonly-used ResNet-50 [<a href="#B50-remotesensing-14-00320" class="html-bibr">50</a>] as the backbone network.</p>
Full article ">Figure 3
<p>A video SAR image. (<b>a</b>) The raw video SAR image; (<b>b</b>) the shadow corresponding ground truths. Here, different vehicles are marked in boxes with different colors and numbers for an intuitive visual observation. This video SAR image is the 50th frame in the SNL data.</p>
Full article ">Figure 4
<p>Image pixel histogram before HESE and after HESE.</p>
Full article ">Figure 5
<p>Moving target shadow before HESE and after HESE. (<b>a</b>) Before HESE; (<b>b</b>) after HESE. The raw video SAR image is in <a href="#remotesensing-14-00320-f003" class="html-fig">Figure 3</a>a.</p>
Full article ">Figure 6
<p>More results of the histogram equalization shadow enhancement (HESE). (<b>a</b>) Before HESE; (<b>b</b>) after HESE. Different vehicles are marked in boxes with different colors and numbers for an intuitive visual observation. #<span class="html-italic">N</span> denotes the <span class="html-italic">N</span>-th frame. The white arrows indicate the moving direction.</p>
Full article ">Figure 7
<p>Residual block in the backbone network. (<b>a</b>) The raw residual block in ResNet-50; (<b>b</b>) the improved residual block with TSAM.</p>
Full article ">Figure 8
<p>Detailed implementation process of TSAM.</p>
Full article ">Figure 9
<p>Moving target shadow deformation with the change of moving speed. From left to right (#64 → #74 → #84 → #94), the speed becomes smaller and smaller. The blue arrow indicates the moving direction.</p>
Full article ">Figure 10
<p>Different convolutions. (<b>a</b>) Classical convolution; (<b>b</b>) deformation convolution.</p>
Full article ">Figure 11
<p>Detailed implementation process of SDAL.</p>
Full article ">Figure 12
<p>Sketch map of different anchor distributions. (<b>a</b>) The raw distribution; (<b>b</b>) the improved distribution with SGAAL. Anchors are marked in blue boxes.</p>
Full article ">Figure 13
<p>Detailed implementation process of SGAAL.</p>
Full article ">Figure 14
<p>Detailed implementation process of OHEM.</p>
Full article ">Figure 15
<p>Experimental working environment of the SNL video SAR data at the Kirtland Airforce Base Eubank Gate. (<b>a</b>) The optical image; (<b>b</b>) the corresponding SAR image.</p>
Full article ">Figure 16
<p>The accuracy changing curves with different IOU thresholds of different methods. (<b>a</b>) The curve between recall (<span class="html-italic">r</span>) and IOU; (<b>b</b>) the curve between precision (<span class="html-italic">p</span>) and IOU; (<b>c</b>) the curve between average precision (<span class="html-italic">ap</span>) and IOU; (<b>d</b>) the curve between <span class="html-italic">f</span>1 and IOU.</p>
Full article ">Figure 17
<p>Precision–recall (<span class="html-italic">p</span>–<span class="html-italic">r</span>) curves of different methods.</p>
Full article ">Figure 18
<p>Qualitative video SAR moving target shadow detection results of different methods. (<b>a</b>) Ground truth; (<b>b</b>) Faster R-CNN; (<b>c</b>) FPN; (<b>d</b>) YOLOv3; (<b>e</b>) RetinaNet; (<b>f</b>) CenterNet; (<b>g</b>) ShadowDeNet. The false alarms are marked by orange boxes. The missed detections are marked by red ellipses. Apart from CenterNet, the numbers above boxes are the confidences. The numbers above boxes in (<b>f</b>) denote CenterNet’s Gaussian heatmap probabilities of the top five keypoints. The IOU threshold is 0.50, the same as the PASCAL VOC criterion [<a href="#B79-remotesensing-14-00320" class="html-bibr">79</a>].</p>
Full article ">Figure 18 Cont.
<p>Qualitative video SAR moving target shadow detection results of different methods. (<b>a</b>) Ground truth; (<b>b</b>) Faster R-CNN; (<b>c</b>) FPN; (<b>d</b>) YOLOv3; (<b>e</b>) RetinaNet; (<b>f</b>) CenterNet; (<b>g</b>) ShadowDeNet. The false alarms are marked by orange boxes. The missed detections are marked by red ellipses. Apart from CenterNet, the numbers above boxes are the confidences. The numbers above boxes in (<b>f</b>) denote CenterNet’s Gaussian heatmap probabilities of the top five keypoints. The IOU threshold is 0.50, the same as the PASCAL VOC criterion [<a href="#B79-remotesensing-14-00320" class="html-bibr">79</a>].</p>
Full article ">Figure 19
<p>Different histogram equalization shadow enhancements. (<b>a</b>) Shadows in the raw video SAR image; (<b>b</b>) shadows enhanced by HESE; (<b>c</b>) shadows enhanced by AHESE.</p>
Full article ">Figure 20
<p>Qualitative video SAR moving target shadow detection results of ShadowDeNet on the CASIC 23 research institute data. The ground truths are marked by green boxes. The numbers above boxes are the confidences. The IOU threshold is 0.50, the same as the PASCAL VOC criterion [<a href="#B79-remotesensing-14-00320" class="html-bibr">79</a>].</p>
Full article ">
19 pages, 4031 KiB  
Article
Saliency Detection with Moving Camera via Background Model Completion
by Yu-Pei Zhang and Kwok-Leung Chan
Sensors 2021, 21(24), 8374; https://doi.org/10.3390/s21248374 - 15 Dec 2021
Cited by 2 | Viewed by 2593
Abstract
Detecting saliency in videos is a fundamental step in many computer vision systems. Saliency is the significant target(s) in the video. The object of interest is further analyzed for high-level applications. The segregation of saliency and the background can be made if they [...] Read more.
Detecting saliency in videos is a fundamental step in many computer vision systems. Saliency is the significant target(s) in the video. The object of interest is further analyzed for high-level applications. The segregation of saliency and the background can be made if they exhibit different visual cues. Therefore, saliency detection is often formulated as background subtraction. However, saliency detection is challenging. For instance, dynamic background can result in false positive errors. In another scenario, camouflage will result in false negative errors. With moving cameras, the captured scenes are even more complicated to handle. We propose a new framework, called saliency detection via background model completion (SD-BMC), that comprises a background modeler and a deep learning background/foreground segmentation network. The background modeler generates an initial clean background image from a short image sequence. Based on the idea of video completion, a good background frame can be synthesized with the co-existence of changing background and moving objects. We adopt the background/foreground segmenter, which was pre-trained with a specific video dataset. It can also detect saliency in unseen videos. The background modeler can adjust the background image dynamically when the background/foreground segmenter output deteriorates during processing a long video. To the best of our knowledge, our framework is the first one to adopt video completion for background modeling and saliency detection in videos captured by moving cameras. The F-measure results, obtained from the pan-tilt-zoom (PTZ) videos, show that our proposed framework outperforms some deep learning-based background subtraction models by 11% or more. With more challenging videos, our framework also outperforms many high-ranking background subtraction methods by more than 3%. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>Overview of the saliency detection framework: (<b>a</b>) background model initialization; (<b>b</b>) continuous saliency detection.</p>
Full article ">Figure 2
<p>Video completion-based background modeler.</p>
Full article ">Figure 3
<p>Visual results of background modeling: (<b>a</b>) original frame; (<b>b</b>) 30 initialization frames; and (<b>c</b>) 100 initialization frames.</p>
Full article ">Figure 4
<p>Structure of foreground segmenter.</p>
Full article ">Figure 5
<p>Visual results of BSUV-Net 2.0 and SD-BMC on CDNet 2014.</p>
Full article ">Figure 5 Cont.
<p>Visual results of BSUV-Net 2.0 and SD-BMC on CDNet 2014.</p>
Full article ">Figure 6
<p>Visual results on customized dataset.</p>
Full article ">Figure 7
<p>Comparison of background frames used in BSUV-Net 2.0, PAWCS, SuBSENSE, and SD-BMC on PTZ video.</p>
Full article ">
14 pages, 2603 KiB  
Article
B-Line Detection and Localization in Lung Ultrasound Videos Using Spatiotemporal Attention
by Hamideh Kerdegari, Nhat Tran Huy Phung, Angela McBride, Luigi Pisani, Hao Van Nguyen, Thuy Bich Duong, Reza Razavi, Louise Thwaites, Sophie Yacoub, Alberto Gomez and VITAL Consortium
Appl. Sci. 2021, 11(24), 11697; https://doi.org/10.3390/app112411697 - 9 Dec 2021
Cited by 13 | Viewed by 3535
Abstract
The presence of B-line artefacts, the main artefact reflecting lung abnormalities in dengue patients, is often assessed using lung ultrasound (LUS) imaging. Inspired by human visual attention that enables us to process videos efficiently by paying attention to where and when it is [...] Read more.
The presence of B-line artefacts, the main artefact reflecting lung abnormalities in dengue patients, is often assessed using lung ultrasound (LUS) imaging. Inspired by human visual attention that enables us to process videos efficiently by paying attention to where and when it is required, we propose a spatiotemporal attention mechanism for B-line detection in LUS videos. The spatial attention allows the model to focus on the most task relevant parts of the image by learning a saliency map. The temporal attention generates an attention score for each attended frame to identify the most relevant frames from an input video. Our model not only identifies videos where B-lines show, but also localizes, within those videos, B-line related features both spatially and temporally, despite being trained in a weakly-supervised manner. We evaluate our approach on a LUS video dataset collected from severe dengue patients in a resource-limited hospital, assessing the B-line detection rate and the model’s ability to localize discriminative B-line regions spatially and B-line frames temporally. Experimental results demonstrate the efficacy of our approach for classifying B-line videos with an F1 score of up to 83.2% and localizing the most salient B-line regions both spatially and temporally with a correlation coefficient of 0.67 and an IoU of 69.7%, respectively. Full article
(This article belongs to the Special Issue Computational Ultrasound Imaging and Applications)
Show Figures

Figure 1

Figure 1
<p>Sample LUS images. (<b>Left</b>): A healthy lung containing several A-line artefacts and, (<b>Right</b>): A dengue patient’s lung showing a B-line artefact as a result of fluid leakage into the lung.</p>
Full article ">Figure 2
<p>The proposed architecture for LUS B-line detection and spatiotemporal localization. This model consists of a spatial feature extraction module (CNN layers), followed by a spatial attention network, then a bidirectional LSTM, and a temporal attention module. The parameters of each layer and module are detailed in the text.</p>
Full article ">Figure 3
<p>The CNN architecture of the proposed model. It consists of four convolution layers with Relu activation function and; Maxpooling followed by the second and forth convolution layer.</p>
Full article ">Figure 4
<p>Spatial attention module. Several layers of convolutional networks (for details see <a href="#applsci-11-11697-t001" class="html-table">Table 1</a>) are used to learn the importance mask <math display="inline"><semantics> <msub> <mi>M</mi> <mi>i</mi> </msub> </semantics></math> for the input image feature <math display="inline"><semantics> <msub> <mi>X</mi> <mi>i</mi> </msub> </semantics></math>, the output is the element-wise multiplication <math display="inline"><semantics> <mrow> <mover accent="true"> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>˜</mo> </mover> <mo>=</mo> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>⊙</mo> <msub> <mi>M</mi> <mi>i</mi> </msub> </mrow> </semantics></math>. GAP: Global Average Pooling.</p>
Full article ">Figure 5
<p>Examples of B-line regions annotation used for spatial attention task. A straight yellow line was drawn on the B-line region extending from the surface of the lung distally following the direction of propagation of the sound waves.</p>
Full article ">Figure 6
<p>Examples of spatial attention map for B-line localization task. Our spatial attention module can automatically highlight B-line regions (red areas) and avoid irrelevant regions corresponding to no-B-line regions or background. Yellow straight lines represent ground truth. Correlation coefficient values (<span class="html-italic">r</span>) are presented at the bottom of each attention map.</p>
Full article ">Figure 7
<p><b>Top</b>: An example of polar coordinates applied to a sample B-line frame; the red cross shows the beam source. <b>Center</b>: The generated 1-dimensional diagram showing it’s related ground truth (green line, the black line is normal distribution), and <b>Bottom</b>: attention map values (red line) across the coordinates. In this example, the correlation coefficient value is <span class="html-italic">r</span> = 0.71.</p>
Full article ">Figure 8
<p>Generated temporal (horizontal axis) and spatial (heatmap overlaid onto B-Mode B-line frames) attentions estimated by our model on an example of an LUS video that includes both B-line and non-B-line frames. The top graph shows the temporal attention weights (in blue) and the corresponding ground truth annotations (in green). Spatial attention maps are visualized for B-line frames (for example frames 16 and 22): the yellow lines show the manual B-line annotations and, the correlation coefficient values (<span class="html-italic">r</span>), computed as described in the text, are presented at the bottom of each frame for illustration.</p>
Full article ">
29 pages, 1759 KiB  
Article
Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning
by Cristina Luna-Jiménez, David Griol, Zoraida Callejas, Ricardo Kleinlein, Juan M. Montero and Fernando Fernández-Martínez
Sensors 2021, 21(22), 7665; https://doi.org/10.3390/s21227665 - 18 Nov 2021
Cited by 76 | Viewed by 10998
Abstract
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech [...] Read more.
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance. Full article
(This article belongs to the Special Issue Multimodal Emotion Recognition in Artificial Intelligence)
Show Figures

Figure 1

Figure 1
<p>Block diagram of the implemented systems.</p>
Full article ">Figure 2
<p>Proposed pipelines for speech emotion recognition.</p>
Full article ">Figure 3
<p>Spatial Transformer CNN architecture with visual saliency-based masks.</p>
Full article ">Figure 4
<p>Bidirectional-LSTM with attention mechanism for facial emotion recognition at the video level. Modified version from source [<a href="#B77-sensors-21-07665" class="html-bibr">77</a>].</p>
Full article ">Figure 5
<p>Average confusion matrix of the fine-tuned CNN-14 experiment with an accuracy of 76.58%.</p>
Full article ">Figure 6
<p>Average confusion matrix of the bi-LSTM with two layers of 300 neurons and two attention layers trained with the embeddings extracted from the flattened-810 of the fine-tuned STN. Accuracy of 57.08%. See <a href="#sensors-21-07665-t003" class="html-table">Table 3</a>.</p>
Full article ">Figure 7
<p>The top average accuracy of the 5-CV obtained for speech and visual modalities with a 95% confidence interval. In orange, the experiments with the original videos; in blue, the samples with speech; in green, the mix of the top modalities: the speech model without VAD and the visual model with VAD.</p>
Full article ">Figure 8
<p>Average confusion matrix of the top late fusion strategy using a LinearSVC combining the top results of SER for the version without VAD and the FER for the version with VAD. Accuracy of 80.08%. See <a href="#sensors-21-07665-t0A3" class="html-table">Table A3</a>.</p>
Full article ">Figure A1
<p>Example of frames from a video tagged as ‘Calm’ with some samples predicted as ‘Happy’. The whole video was correctly predicted as ‘Calm’.</p>
Full article ">Figure A2
<p>Example of frames from a video tagged as ‘Surprised’ incorrectly predicted as ‘Happy’. The whole video was incorrectly predicted as ‘Happy’.</p>
Full article ">Figure A3
<p>Example of frames from a video tagged as ‘Sad’ incorrectly predicted as ‘Fearful’. The whole video was incorrectly predicted as ‘Fearful’.</p>
Full article ">
18 pages, 9636 KiB  
Article
Video Desnowing and Deraining via Saliency and Dual Adaptive Spatiotemporal Filtering
by Yongji Li, Rui Wu, Zhenhong Jia, Jie Yang and Nikola Kasabov
Sensors 2021, 21(22), 7610; https://doi.org/10.3390/s21227610 - 16 Nov 2021
Cited by 5 | Viewed by 2300
Abstract
Outdoor vision sensing systems often struggle with poor weather conditions, such as snow and rain, which poses a great challenge to existing video desnowing and deraining methods. In this paper, we propose a novel video desnowing and deraining model that utilizes the salience [...] Read more.
Outdoor vision sensing systems often struggle with poor weather conditions, such as snow and rain, which poses a great challenge to existing video desnowing and deraining methods. In this paper, we propose a novel video desnowing and deraining model that utilizes the salience information of moving objects to address this problem. First, we remove the snow and rain from the video by low-rank tensor decomposition, which makes full use of the spatial location information and the correlation between the three channels of the color video. Second, because existing algorithms often regard sparse snowflakes and rain streaks as moving objects, this paper injects salience information into moving object detection, which reduces the false alarms and missed alarms of moving objects. At the same time, feature point matching is used to mine the redundant information of moving objects in continuous frames, and a dual adaptive minimum filtering algorithm in the spatiotemporal domain is proposed by us to remove snow and rain in front of moving objects. Both qualitative and quantitative experimental results show that the proposed algorithm is more competitive than other state-of-the-art snow and rain removal methods. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

Figure 1
<p>The flow diagram of our proposed algorithm.</p>
Full article ">Figure 2
<p>Extracting the low-rank background (<b>b</b>) from a snow video sequence (<b>a</b>).</p>
Full article ">Figure 3
<p>(<b>a</b>) The moving object matching processes, (<b>b</b>) the result of dual adaptive spatiotemporal filtering, (<b>c</b>) the clean video frame obtained by pasting the desnowing moving object back into a low-rank background.</p>
Full article ">Figure 4
<p>Comparison on a synthetic snow video. (<b>a</b>) Ground truth, (<b>b</b>) input, (<b>c</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>d</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>e</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>f</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>g</b>) proposed method.</p>
Full article ">Figure 5
<p>Comparison on a synthetic rain video. (<b>a</b>) Ground truth, (<b>b</b>) input, (<b>c</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>d</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>e</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>f</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>g</b>) proposed method.</p>
Full article ">Figure 6
<p>Comparison on a real snow video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 7
<p>Comparison on a real rain video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 8
<p>Comparison on a real snow video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 9
<p>Comparison on a real rain video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 10
<p>Comparison on a real snow video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 11
<p>Comparison on a real rain video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 12
<p>Comparison on a real rain video. (<b>a</b>) Input, (<b>b</b>) Kim et al. [<a href="#B16-sensors-21-07610" class="html-bibr">16</a>], (<b>c</b>) Wang et al. [<a href="#B8-sensors-21-07610" class="html-bibr">8</a>], (<b>d</b>) Li et al. [<a href="#B14-sensors-21-07610" class="html-bibr">14</a>], (<b>e</b>) Chen et al. [<a href="#B32-sensors-21-07610" class="html-bibr">32</a>], (<b>f</b>) proposed method.</p>
Full article ">Figure 13
<p>Runtime comparison of comparable methods on two videos. (<b>a</b>) The test object is the synthetic snow video (<a href="#sensors-21-07610-f004" class="html-fig">Figure 4</a>). (<b>b</b>) The test object is the real rain video (<a href="#sensors-21-07610-f011" class="html-fig">Figure 11</a>).</p>
Full article ">
Back to TopTop