[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (53)

Search Parameters:
Keywords = compressed video enhancement

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 4619 KiB  
Article
Efficient Video Compression Using Afterimage Representation
by Minseong Jeon and Kyungjoo Cheoi
Sensors 2024, 24(22), 7398; https://doi.org/10.3390/s24227398 - 20 Nov 2024
Viewed by 577
Abstract
Recent advancements in large-scale video data have highlighted the growing need for efficient data compression techniques to enhance video processing performance. In this paper, we propose an afterimage-based video compression method that significantly reduces video data volume while maintaining analytical performance. The proposed [...] Read more.
Recent advancements in large-scale video data have highlighted the growing need for efficient data compression techniques to enhance video processing performance. In this paper, we propose an afterimage-based video compression method that significantly reduces video data volume while maintaining analytical performance. The proposed approach utilizes optical flow to adaptively select the number of keyframes based on scene complexity, optimizing compression efficiency. Additionally, object movement masks extracted from keyframes are accumulated over time using alpha blending to generate the final afterimage. Experiments on the UCF-Crime dataset demonstrated that the proposed method achieved a 95.97% compression ratio. In binary classification experiments on normal/abnormal behaviors, the compressed videos maintained performance comparable to the original videos, while in multi-class classification, they outperformed the originals. Notably, classification experiments focused exclusively on abnormal behaviors exhibited a significant 4.25% improvement in performance. Moreover, further experiments showed that large language models (LLMs) can interpret the temporal context of original videos from single afterimages. These findings confirm that the proposed afterimage-based compression technique effectively preserves spatiotemporal information while significantly reducing data size. Full article
Show Figures

Figure 1

Figure 1
<p>Example of afterimages: (<b>a</b>) an afterimage showing behavioral interactions within a single group; (<b>b</b>) an afterimage depicting simultaneous behavioral interactions between two groups.</p>
Full article ">Figure 2
<p>Workflow of the proposed afterimage generation pipeline.</p>
Full article ">Figure 3
<p>Results of applying the proposed keyframe selection algorithm: (<b>a</b>) keyframe extraction from the ‘daria_jack’ video in the Weizmann dataset; (<b>b</b>) keyframe extraction from the ‘Burglary034_x364’ video in the UCF-Crime dataset.</p>
Full article ">Figure 4
<p>Chronological sequence of afterimages generated from the ‘Abuse001_x264’ video. <b>Top row</b>: afterimages 1–4; <b>middle row</b>: afterimages 5–8; <b>bottom row</b>: afterimages 9–11.</p>
Full article ">
16 pages, 21131 KiB  
Article
GCS-YOLOv8: A Lightweight Face Extractor to Assist Deepfake Detection
by Ruifang Zhang, Bohan Deng, Xiaohui Cheng and Hong Zhao
Sensors 2024, 24(21), 6781; https://doi.org/10.3390/s24216781 - 22 Oct 2024
Viewed by 837
Abstract
To address the issues of target feature blurring and increased false detections caused by high compression rates in deepfake videos, as well as the high computational resource requirements of existing face extractors, we propose a lightweight face extractor to assist deepfake detection, GCS-YOLOv8. [...] Read more.
To address the issues of target feature blurring and increased false detections caused by high compression rates in deepfake videos, as well as the high computational resource requirements of existing face extractors, we propose a lightweight face extractor to assist deepfake detection, GCS-YOLOv8. Firstly, we employ the HGStem module for initial downsampling to address the issue of false detections of small non-face objects in deepfake videos, thereby improving detection accuracy. Secondly, we introduce the C2f-GDConv module to mitigate the low-FLOPs pitfall while reducing the model’s parameters, thereby lightening the network. Additionally, we add a new P6 large target detection layer to expand the receptive field and capture multi-scale features, solving the problem of detecting large-scale faces in low-compression deepfake videos. We also design a cross-scale feature fusion module called CCFG (CNN-based Cross-Scale Feature Fusion with GDConv), which integrates features from different scales to enhance the model’s adaptability to scale variations while reducing network parameters, addressing the high computational resource requirements of traditional face extractors. Furthermore, we improve the detection head by utilizing group normalization and shared convolution, simplifying the process of face detection while maintaining detection performance. The training dataset was also refined by removing low-accuracy and low-resolution labels, which reduced the false detection rate. Experimental results demonstrate that, compared to YOLOv8, this face extractor achieves the AP of 0.942, 0.927, and 0.812 on the WiderFace dataset’s Easy, Medium, and Hard subsets, representing improvements of 1.1%, 1.3%, and 3.7% respectively. The model’s parameters and FLOPs are only 1.68 MB and 3.5 G, reflecting reductions of 44.2% and 56.8%, making it more effective and lightweight in extracting faces from deepfake videos. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

Figure 1
<p>Structure of the YOLOv8.</p>
Full article ">Figure 2
<p>Structure of the GCS-YOLOv8.</p>
Full article ">Figure 3
<p>Structure of the HGStem.</p>
Full article ">Figure 4
<p>Structure of the C2f-GDConv.</p>
Full article ">Figure 5
<p>Structure of the Detect head and the GSCD.</p>
Full article ">Figure 6
<p>Comparison of detection effects on WiderFace test sets.</p>
Full article ">Figure 7
<p>Comparison of detection effects on Celeb-DF-v2 and FF++.</p>
Full article ">
26 pages, 1960 KiB  
Article
Fast CU Partition Decision Algorithm Based on Bayesian and Texture Features
by Erlin Tian, Yifan Yang and Qiuwen Zhang
Electronics 2024, 13(20), 4082; https://doi.org/10.3390/electronics13204082 - 17 Oct 2024
Viewed by 715
Abstract
As internet speeds increase and user demands for video quality grow, video coding standards continue to evolve. H.266/Versatile Video Coding (VVC), as the new generation of video coding standards, further improves compression efficiency but also brings higher computational complexity. Despite the significant advancements [...] Read more.
As internet speeds increase and user demands for video quality grow, video coding standards continue to evolve. H.266/Versatile Video Coding (VVC), as the new generation of video coding standards, further improves compression efficiency but also brings higher computational complexity. Despite the significant advancements VVC has made in compression ratio and video quality, the introduction of new coding techniques and complex coding unit (CU) partitioning methods have also led to increased encoding complexity. This complexity not only extends encoding time but also increases hardware resource consumption, limiting the application of VVC in real-time video processing and low-power devices.To alleviate the encoding complexity of VVC, this paper puts forward a Bayesian and texture-feature-based fast splitting algorithm for coding intraframe bloc of VVC, which aims to reduce unnecessary computational steps, enhance encoding efficiency, and maintain video quality as much as possible. In the stage of rapid coding, the video frames are coded by the original VVC test model (VTM), and Joint Rough Mode Decision (JRMD) evaluation cost is used to update the parameter in the Bayesian algorithm to come and set the two thresholds to judge whether the current coding block continues to be split or not. Then, for coding blocks larger than those satisfying the above threshold conditions, the predominant direction of the texture within the coding block is ascertained by calculating the standard deviations along both the horizontal and vertical axes so as to skip some unnecessary splits in the current coding block patterns. The findings from our experiments demonstrate that our proposed approach improves the encoding rate by 1.40% on average, and the execution time of the encoder has been reduced by 49.50%. The overall algorithm has optimized the VVC intraframe coding technology and reduced the coding complexity of VVC. Full article
Show Figures

Figure 1

Figure 1
<p>CTU partition map.</p>
Full article ">Figure 2
<p>Fitting of Gaussian distribution for the JRMD number of CUs.</p>
Full article ">Figure 3
<p>Gaussian distribution fitting for JRMD number counts and CUs.</p>
Full article ">Figure 4
<p>Texture characteristics of 64 × 64 CUs.</p>
Full article ">Figure 5
<p>Percentage of the prediction accuracy at different threshold values.</p>
Full article ">Figure 6
<p>Overall algorithmic flow framework.</p>
Full article ">Figure 7
<p>Comparison of VTM 10.0 standard encoder RD curves with proposed algorithm on each test sequence.</p>
Full article ">Figure 8
<p>Contrasting the mean BDBR and TS values across various algorithms.</p>
Full article ">Figure 9
<p>RD curves of overall scheme performance.</p>
Full article ">
12 pages, 3068 KiB  
Article
Performance Exploration of Optical Wireless Video Communication Based on Adaptive Block Sampling Compressive Sensing
by Jinwang Li, Haifeng Yao, Keyan Dong, Yansong Song, Tianci Liu, Zhongyu Cao, Weihao Wang, Yixiang Zhang, Kunpeng Jiang and Zhi Liu
Photonics 2024, 11(10), 969; https://doi.org/10.3390/photonics11100969 - 16 Oct 2024
Viewed by 645
Abstract
Optical wireless video transmission technology combines the advantages of high data rates, enhanced security, large bandwidth capacity, and strong anti-interference capabilities inherent in optical communication, establishing it as a pivotal technology in contemporary data transmission networks. However, video data comprises a large volume [...] Read more.
Optical wireless video transmission technology combines the advantages of high data rates, enhanced security, large bandwidth capacity, and strong anti-interference capabilities inherent in optical communication, establishing it as a pivotal technology in contemporary data transmission networks. However, video data comprises a large volume of image information, resulting in substantial data flow with significant redundant bits. To address this, we propose an adaptive block sampling compressive sensing algorithm that overcomes the limitations of sampling inflexibility in traditional compressive sensing, which often leads to either redundant or insufficient local sampling. This method significantly reduces the presence of redundant bits in video images. First, the sampling mechanism of the block-based compressive sensing algorithm was optimized. Subsequently, a wireless optical video transmission experimental system was developed using a Field-Programmable Gate Array chip. Finally, experiments were conducted to evaluate the transmission of video optical signals. The results demonstrate that the proposed algorithm improves the peak signal-to-noise ratio by over 3 dB compared to other algorithms, with an enhancement exceeding 1.5 dB even in field tests, thereby significantly optimizing video transmission quality. This research contributes essential technical insights for the enhancement of wireless optical video transmission performance. Full article
(This article belongs to the Special Issue Next-Generation Free-Space Optical Communication Technologies)
Show Figures

Figure 1

Figure 1
<p>Image segmentation diagram.</p>
Full article ">Figure 2
<p>Adaptive block sampling compressed sensing algorithm flow.</p>
Full article ">Figure 3
<p>Saliency map acquisition; (<b>a</b>) original image; (<b>b</b>) saliency map obtained in Gaussian domain 7 × 7; (<b>c</b>) saliency map obtained in Gaussian domain 5 × 5; (<b>d</b>) saliency map obtained in Gaussian domain 3 × 3.</p>
Full article ">Figure 4
<p>Comparison of reconstructed image results: (<b>a</b>) PSNR index statistics of the reconstructed image, (<b>b</b>) SSIM index statistics of the reconstructed image, (<b>c</b>) NMSE index statistics of the reconstructed image, and (<b>d</b>) GMSD index statistics of the reconstructed image.</p>
Full article ">Figure 5
<p>Principle of space optical wireless video transmission.</p>
Full article ">Figure 6
<p>Space optical wireless video transmission experiment; (<b>a</b>) diagram 1 of spatial optical wireless video transmission system; (<b>b</b>) diagram 2 of spatial optical wireless video transmission system; (<b>c</b>) diagram 1 of setup for spatial optical wireless video transmission experiment; (<b>d</b>) diagram 2 of setup for spatial optical wireless video transmission experiment.</p>
Full article ">Figure 7
<p>Optical wireless transmission reconstruction results of video generated from a total of 500 frame image sequences: (<b>a</b>) PSNR index statistics (average value of the reconstructed image for each frame: APS-SPL = 38.56 dB, MS-SPL = 36.97 dB, 2DCS = 36.33 dB, SPL = 32.98 dB), (<b>b</b>) SSIM index statistics (average value of the reconstructed image for each frame: APS-SPL = 0.9755, MS-SPL = 0.9638, 2DCS = 0.957, SPL = 0.9168).</p>
Full article ">
12 pages, 2587 KiB  
Article
Preprocessing for Multi-Dimensional Enhancement and Reconstruction in Neural Video Compression
by Jiajia Wang, Qi Zhang, Haiwu Zhao, Guozhong Wang and Xiwu Shang
Appl. Sci. 2024, 14(19), 8626; https://doi.org/10.3390/app14198626 - 25 Sep 2024
Viewed by 1030
Abstract
The surge in ultra-high-definition video content has intensified the demand for advanced video compression techniques. Video encoding preprocessing can improve coding efficiency while ensuring a high degree of compatibility with existing codecs. Existing video encoding preprocessing methods are limited in their ability to [...] Read more.
The surge in ultra-high-definition video content has intensified the demand for advanced video compression techniques. Video encoding preprocessing can improve coding efficiency while ensuring a high degree of compatibility with existing codecs. Existing video encoding preprocessing methods are limited in their ability to fully exploit redundant features in video data and recover high-frequency details, and their network architectures often lack compatibility with neural video encoders. To addressing these challenges, we propose a Multi-Dimensional Enhancement and Reconstruction (MDER) preprocessing method to improve the efficiency of deep learning-based neural video encoders. Firstly, our approach integrates a degradation compensation module to mitigate encoding noise and boost feature extraction efficiency. Secondly, a lightweight fully convolutional neural network is employed, which utilizes residual learning and knowledge distillation to refine and suppress irrelevant features across spatial and channel dimensions. Furthermore, to maximize the use of redundant information, we incorporate Dense Blocks, which can enhance and reconstruct important features in the video data during preprocessing. Finally, the preprocessed frames are then mapped from pixel space to feature space through the Dense Feature-Enhanced Video Compression (DFVC) module, which improves motion estimation and compensation accuracy. The experimental results demonstrate that, compared to neural video encoders, the MDER method can reduce bits per pixel (Bpp) by 0.0714 and 0.0536 under equivalent PSNR and MS-SSIM conditions, respectively. These results demonstrate significant improvements in compression efficiency and reconstruction quality, highlighting the effectiveness of the MDER preprocessing method and its compatibility with neural video codec workflows. Full article
Show Figures

Figure 1

Figure 1
<p>The deployment workflow of the MDER preprocessing method: single-pass input frames of the original video sequence and application to neural video codec.</p>
Full article ">Figure 2
<p>Overall architecture of the MDER preprocessing method.</p>
Full article ">Figure 3
<p>FVC framework diagram.</p>
Full article ">Figure 4
<p>Feature Extraction Module and Frame Reconstruction Module in DFVC. (<b>a</b>) Feature extraction module; (<b>b</b>) frame reconstruction module; and (<b>c</b>) Dense Block.</p>
Full article ">Figure 5
<p>The rate–distortion curves for the VVC Class B–D dataset on PSNR and MS-SSIM.</p>
Full article ">Figure 6
<p>Comparison of experimental results for VVC Class B–D datasets on PSNR and MS-SSIM.</p>
Full article ">
24 pages, 6380 KiB  
Article
Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec
by Woowoen Gwun, Kiho Choi and Gwang Hoon Park
Mathematics 2024, 12(18), 2874; https://doi.org/10.3390/math12182874 - 15 Sep 2024
Viewed by 858
Abstract
Over the past few years, there has been substantial interest and research activity surrounding the application of Convolutional Neural Networks (CNNs) for post-filtering in video coding. Most current research efforts have focused on using CNNs with various kernel sizes for post-filtering, primarily concentrating [...] Read more.
Over the past few years, there has been substantial interest and research activity surrounding the application of Convolutional Neural Networks (CNNs) for post-filtering in video coding. Most current research efforts have focused on using CNNs with various kernel sizes for post-filtering, primarily concentrating on High-Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC). This narrow focus has limited the exploration and application of these techniques to other video coding standards such as AV1, developed by the Alliance for Open Media, which offers excellent compression efficiency, reducing bandwidth usage and improving video quality, making it highly attractive for modern streaming and media applications. This paper introduces a novel approach that extends beyond traditional CNN methods by integrating three different self-attention layers into the CNN framework. Applied to the AV1 codec, the proposed method significantly improves video quality by incorporating these distinct self-attention layers. This enhancement demonstrates the potential of self-attention mechanisms to revolutionize post-filtering techniques in video coding beyond the limitations of convolution-based methods. The experimental results show that the proposed network achieves an average BD-rate reduction of 10.40% for the Luma component and 19.22% and 16.52% for the Chroma components compared to the AV1 anchor. Visual quality assessments further validated the effectiveness of our approach, showcasing substantial artifact reduction and detail enhancement in videos. Full article
(This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision)
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) Illustration showing where in-loop filter is located in video codec pipeline; (<b>b</b>) illustration showing where post-filter is located in pipeline.</p>
Full article ">Figure 2
<p>Proposed MTSA-based CNN.</p>
Full article ">Figure 3
<p>(<b>a</b>) RCB; (<b>b</b>) CWSA.</p>
Full article ">Figure 4
<p>(<b>a</b>) Simplified feature map with channel size of 3 and height and width sizes of 4; (<b>b</b>) feature map unfolded into smaller blocks; (<b>c</b>) feature map permuted and reshaped.</p>
Full article ">Figure 5
<p>(<b>a</b>) BWSSA; (<b>b</b>) PWSA.</p>
Full article ">Figure 6
<p>R-D curves by SVT-AV1 and MTSA. (<b>a</b>) class A1; (<b>b</b>) class A2; (<b>c</b>) class A3; (<b>d</b>) class A4; (<b>e</b>) class A5.</p>
Full article ">Figure 7
<p>Example sequence of Class A1 PierSeaSide. (<b>a</b>) Original image from the AVM-CTC sequence; (<b>b</b>) detail inside the yellow box from (<b>a</b>) in the original image; (<b>c</b>) detail inside the yellow box from (<b>a</b>) in the compressed image using SVT-AV1 with QP55; (<b>d</b>) detail inside the yellow box from (<b>a</b>) after applying the post-filter using the proposed network.</p>
Full article ">Figure 8
<p>Example sequence of Class A1 Tango. (<b>a</b>) Original image from the AVM-CTC sequence; (<b>b</b>) detail inside the yellow box from (<b>a</b>) in the original image; (<b>c</b>) detail inside the yellow box from (<b>a</b>) in the compressed image using SVT-AV1 with QP55; (<b>d</b>) detail inside the yellow box from (<b>a</b>) after applying the post-filter using the proposed network.</p>
Full article ">Figure 9
<p>Example sequence of Class A2 RushFieldCuts. (<b>a</b>) Original image from the AVM-CTC sequence; (<b>b</b>) detail inside the yellow box from (<b>a</b>) in the original image; (<b>c</b>) detail inside the yellow box from (<b>a</b>) in the compressed image using SVT-AV1 with QP43; (<b>d</b>) detail inside the yellow box from (<b>a</b>) after applying the post-filter using the proposed network.</p>
Full article ">Figure 10
<p>Methods to handle empty spaces for edge patches; (<b>a</b>) empty spaces filled with zero value; (<b>b</b>) empty spaces filled with edge pixel value extended.</p>
Full article ">Figure 11
<p>Network wrongly turning edge pixel into darker value; (<b>a</b>) pixel value difference between the original video frame and the AV1-encoded frame; (<b>b</b>) pixel value difference between the original video frame and the AV1-encoded frame processed by the proposed network, with larger positive pixel differences in Y indicating that the processed frame is darker, at the bottom of the image.</p>
Full article ">
26 pages, 7340 KiB  
Article
Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement
by Tanni Das, Xilong Liang and Kiho Choi
Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276 - 13 Sep 2024
Viewed by 1135
Abstract
Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such [...] Read more.
Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such as blockiness, blurriness, and ringing, which can detract from the viewer’s experience. To ensure a seamless and engaging video experience, it is essential to remove these artifacts, which improves viewer comfort and engagement. In this paper, we propose a deep feature fusion based convolutional neural network (CNN) architecture (VVC-PPFF) for post-processing approach to further enhance the performance of VVC. The proposed network, VVC-PPFF, harnesses the power of CNNs to enhance decoded frames, significantly improving the coding efficiency of the state-of-the-art VVC video coding standard. By combining deep features from early and later convolution layers, the network learns to extract both low-level and high-level features, resulting in more generalized outputs that adapt to different quantization parameter (QP) values. The proposed VVC-PPFF network achieves outstanding performance, with Bjøntegaard Delta Rate (BD-Rate) improvements of 5.81% and 6.98% for luma components in random access (RA) and low-delay (LD) configurations, respectively, while also boosting peak signal-to-noise ratio (PSNR). Full article
Show Figures

Figure 1

Figure 1
<p>Enhancing video quality with CNN based post-processing in conventional VVC coding workflow.</p>
Full article ">Figure 2
<p>MP4 to YUV conversion and reconstruction using VVenC and VVdeC.</p>
Full article ">Figure 3
<p>Illustration of video-to-image conversion process: (<b>a</b>) original videos converted to original images using FFmpeg, and (<b>b</b>) reconstructed videos converted to reconstructed images using FFmpeg.</p>
Full article ">Figure 4
<p>Illustration of the conversion process from YUV 4:2:0 format to YUV 4:4:4 format before feeding data into the deep learning network.</p>
Full article ">Figure 5
<p>Illustration of down-sampling process of neural network output from YUV 4:4:4 to YUV 4:2:0 format.</p>
Full article ">Figure 6
<p>Architecture of the proposed CNN-based post-filtering method, integrating multiple feature extractions for enhanced output refinement.</p>
Full article ">Figure 7
<p>Comparative visualization of (<b>b</b>) reconstructed frames from anchor VVC and (<b>c</b>) proposed methods for DaylightRoad2 sequence at QP 42 for RA configuration, alongside the (<b>a</b>) original uncompressed reference frame.</p>
Full article ">Figure 8
<p>Comparative visualization of (<b>b</b>) reconstructed frames from anchor VVC and (<b>c</b>) proposed methods for FourPeople sequence at QP 42 for LD configuration, alongside the (<b>a</b>) original uncompressed reference frame.</p>
Full article ">Figure 9
<p>RD curve performance comparison for five different test sequences in RA configuration.</p>
Full article ">Figure 10
<p>RD curve performance comparison for four different test sequences in LD configuration.</p>
Full article ">Figure 11
<p>Visual quality comparison of proposed method with 8 feature extraction blocks for RA and LD scenarios at QP 42: (<b>a</b>) MarketPlace Sequence and (<b>b</b>) PartyScene Sequence.</p>
Full article ">Figure 12
<p>Visual quality comparison of proposed method with 12 feature extraction blocks for RA and LD scenarios at QP 42: (<b>a</b>) RitualDance Sequence and (<b>b</b>) Cactus Sequence.</p>
Full article ">
23 pages, 5896 KiB  
Article
A Lightweight Method for Ripeness Detection and Counting of Chinese Flowering Cabbage in the Natural Environment
by Mengcheng Wu, Kai Yuan, Yuanqing Shui, Qian Wang and Zuoxi Zhao
Agronomy 2024, 14(8), 1835; https://doi.org/10.3390/agronomy14081835 - 20 Aug 2024
Viewed by 917
Abstract
The rapid and accurate detection of Chinese flowering cabbage ripeness and the counting of Chinese flowering cabbage are fundamental for timely harvesting, yield prediction, and field management. The complexity of the existing model structures somewhat hinders the application of recognition models in harvesting [...] Read more.
The rapid and accurate detection of Chinese flowering cabbage ripeness and the counting of Chinese flowering cabbage are fundamental for timely harvesting, yield prediction, and field management. The complexity of the existing model structures somewhat hinders the application of recognition models in harvesting machines. Therefore, this paper proposes the lightweight Cabbage-YOLO model. First, the YOLOv8-n feature pyramid structure is adjusted to effectively utilize the target’s spatial structure information as well as compress the model in size. Second, the RVB-EMA module is introduced as a necking optimization mechanism to mitigate the interference of shallow noise in the high-resolution sounding layer and at the same time to reduce the number of parameters in this model. In addition, the head uses an independently designed lightweight PCDetect detection head, which enhances the computational efficiency of the model. Subsequently, the neck utilizes a lightweight DySample upsampling operator to capture and preserve underlying semantic information. Finally, the attention mechanism SimAm is inserted before SPPF for an enhanced ability to capture foreground features. The improved Cabbage-YOLO is integrated with the Byte Tracker to track and count Chinese flowering cabbage in video sequences. The average detection accuracy of Cabbage-YOLO can reach 86.4%. Compared with the original model YOLOv8-n, its FLOPs, the its number of parameters, and the size of its weights are decreased by about 35.9%, 47.2%, and 45.2%, respectively, and its average detection precision is improved by 1.9% with an FPS of 107.8. In addition, the integrated Cabbage-YOLO with the Byte Tracker can also effectively track and count the detected objects. The Cabbage-YOLO model boasts higher accuracy, smaller size, and a clear advantage in lightweight deployment. Overall, the improved lightweight model can provide effective technical support for promoting intelligent management and harvesting decisions of Chinese flowering cabbage. Full article
(This article belongs to the Special Issue Advanced Machine Learning in Agriculture)
Show Figures

Figure 1

Figure 1
<p>Realistic acquisition scenarios. (<b>A</b>) Data collection platform. (<b>B</b>) Data acquisition camera.</p>
Full article ">Figure 2
<p>Dataset annotations and labeling categories. (<b>A</b>) Annotated image. (<b>B</b>) Example of a tag category.</p>
Full article ">Figure 3
<p>Dataset expansion and label distribution visualization. (<b>A</b>) Original image. (<b>B</b>) Brightness adjustment. (<b>C</b>) Randomized cropping and scaling. (<b>D</b>) Add motion blur. (<b>E</b>) Visualization of the expanded label distribution.</p>
Full article ">Figure 4
<p>Structure of the YOLOV8-n model.</p>
Full article ">Figure 5
<p>Structure of the Cabbage-YOLO model.</p>
Full article ">Figure 6
<p>Structure of the C2f- RepViT-EMA Block. (<b>A</b>) C2f module. (<b>B</b>) C2f- RepViT Block (C2f-RVB) module. (<b>C</b>) C2f- RepViT-EMA Block (C2f-RVB-EMA) module.</p>
Full article ">Figure 7
<p>C2f- Structure of the RepViT-EMA Block. (<b>A</b>) RepViT Block (RVB) module. (<b>B</b>) RepViT-EMA Block (RVB-EMA) module. (<b>C</b>) EMA attention mechanism.</p>
Full article ">Figure 8
<p>Lightweight inspection head construction. (<b>A</b>) PCConv module (PCC). (<b>B</b>) PCDetect detection head.</p>
Full article ">Figure 9
<p>Dynamic up-sampling of the structure of DySample. (<b>A</b>) A sample set generator with a dynamic range factor. (<b>B</b>) Sampling based Dynamic Upsampling.</p>
Full article ">Figure 10
<p>SimAM attention mechanism.</p>
Full article ">Figure 11
<p>Tracker operation process and Byte Tracker structure. (<b>A</b>) Byte Tracker internal operation logic.</p>
Full article ">Figure 12
<p>Training results of the Cabbage-YOLO model.</p>
Full article ">Figure 13
<p>Five examples of lightweight algorithms for ripeness detection of Chinese flowering cabbage. (<b>A</b>) YOLOv3-tiny. (<b>B</b>) YOLOv5-n. (<b>C</b>) YOLOv6-n. (<b>D</b>)YOLOv7-n. (<b>E</b>) YOLOv8-n. (<b>F</b>) Cabbage-YOLO. The yellow diamond box in the figure indicates a missed detection, and the green circle indicates a misdiagnosis.</p>
Full article ">Figure 14
<p>FLOPs for each module of Cabbage-YOLO.</p>
Full article ">Figure 15
<p>(<b>A</b>–<b>I</b>) Example of tracking counts (the yellow diamond box in the figure indicates a missed detection).</p>
Full article ">Figure 16
<p>Scatterplot of tracked predicted versus actual values. (The green regression line in the plot has a slope of 1 and an intercept of 0. The Red dotted frame highlights the scenario where the target quantity is 50).</p>
Full article ">
36 pages, 3308 KiB  
Review
Fractional Calculus Meets Neural Networks for Computer Vision: A Survey
by Cecília Coelho, M. Fernanda P. Costa and Luís L. Ferrás
AI 2024, 5(3), 1391-1426; https://doi.org/10.3390/ai5030067 - 7 Aug 2024
Cited by 1 | Viewed by 1781
Abstract
Traditional computer vision techniques aim to extract meaningful information from images but often depend on manual feature engineering, making it difficult to handle complex real-world scenarios. Fractional calculus (FC), which extends derivatives to non-integer orders, provides a flexible way to model systems with [...] Read more.
Traditional computer vision techniques aim to extract meaningful information from images but often depend on manual feature engineering, making it difficult to handle complex real-world scenarios. Fractional calculus (FC), which extends derivatives to non-integer orders, provides a flexible way to model systems with memory effects and long-term dependencies, making it a powerful tool for capturing fractional rates of variation. Recently, neural networks (NNs) have demonstrated remarkable capabilities in learning complex patterns directly from raw data, automating computer vision tasks and enhancing performance. Therefore, the use of fractional calculus in neural network-based computer vision is a powerful method to address existing challenges by effectively capturing complex spatial and temporal relationships in images and videos. This paper presents a survey of fractional calculus neural network-based (FC NN-based) computer vision techniques for denoising, enhancement, object detection, segmentation, restoration, and NN compression. This survey compiles existing FFC NN-based approaches, elucidates underlying concepts, and identifies open questions and research directions. By leveraging FC’s properties, FC NN-based approaches offer a novel way to improve the robustness and efficiency of computer vision systems. Full article
(This article belongs to the Special Issue Artificial Intelligence-Based Image Processing and Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Example use cases of different tasks in computer vision: denoising for removing unwanted noise, enhancement for ground-truth image’s quality improvement, object detection for identification and labelling, segmentation for image partitioning for further analysis, and restoration for missing parts inpainting (ground-truth image generated by DALL-E 3).</p>
Full article ">Figure 2
<p>Architecture of FOCNet.</p>
Full article ">Figure 3
<p>Schematic representation of a multi-scale FOCNet with two levels.</p>
Full article ">Figure 4
<p>Training process of Neural Fractional-Order Adaptive Masks.</p>
Full article ">Figure 5
<p>Image enhancement with fractional Rényi entropy before using a CNN for image segmentation.</p>
Full article ">Figure 6
<p>Architecture of FrOLM-DNN for object detection and classification of 3D image (input image generated by DALL-E 3).</p>
Full article ">Figure 7
<p>Architecture of a GSN with a ScatNet and PCA encoder, and a CNN decoder.</p>
Full article ">Figure 8
<p>Architecture of a GFRSN with an FrScatNet and FM encoder, and a CNN decoder. Using two GFRSNs with different fractional-orders, one can enhance the predicted image <math display="inline"><semantics> <mover accent="true"> <mi mathvariant="bold-italic">y</mi> <mo>˜</mo> </mover> </semantics></math> by merging the outputs from both orders, <math display="inline"><semantics> <msub> <mover accent="true"> <mi mathvariant="bold-italic">y</mi> <mo>˜</mo> </mover> <msub> <mi>α</mi> <mn>1</mn> </msub> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mover accent="true"> <mi mathvariant="bold-italic">y</mi> <mo>˜</mo> </mover> <msub> <mi>α</mi> <mn>2</mn> </msub> </msub> </semantics></math>.</p>
Full article ">
27 pages, 5463 KiB  
Article
Best Practices for Measuring the Modulation Transfer Function of Video Endoscopes
by Quanzeng Wang, Chinh Tran, Peter Burns and Nader M. Namazi
Sensors 2024, 24(15), 5075; https://doi.org/10.3390/s24155075 - 5 Aug 2024
Viewed by 1526
Abstract
Endoscopes are crucial for assisting in surgery and disease diagnosis, including the early detection of cancer. The effective use of endoscopes relies on their optical performance, which can be characterized with a series of metrics such as resolution, vital for revealing anatomical details. [...] Read more.
Endoscopes are crucial for assisting in surgery and disease diagnosis, including the early detection of cancer. The effective use of endoscopes relies on their optical performance, which can be characterized with a series of metrics such as resolution, vital for revealing anatomical details. The modulation transfer function (MTF) is a key metric for evaluating endoscope resolution. However, the 2020 version of the ISO 8600-5 standard, while introducing an endoscope MTF measurement method, lacks empirical validation and excludes opto-electronic video endoscopes, the largest family of endoscopes. Measuring the MTF of video endoscopes requires tailored standards that address their unique characteristics. This paper aims to expand the scope of ISO 8600-5:2020 to include video endoscopes, by optimizing the MTF test method and addressing parameters affecting measurement accuracy. We studied the effects of intensity and uniformity of image luminance, chart modulation compensation, linearity of image digital values, auto gain control, image enhancement, image compression and the region of interest dimensions on images of slanted-edge test charts, and thus the MTF based on these images. By analyzing these effects, we provided recommendations for setting and controlling these factors to obtain accurate MTF curves. Our goal is to enhance the standard’s relevance and effectiveness for measuring the MTF of a broader range of endoscopic devices, with potential applications in the MTF measurement of other digital imaging devices. Full article
(This article belongs to the Special Issue Medical Imaging and Sensing Technologies)
Show Figures

Figure 1

Figure 1
<p>Experimental setup (using the inherent xenon light source).</p>
Full article ">Figure 2
<p>Edge center locations and directions on chart images for MTF calculation. “On-axis” point A is located at the image center. “Off-axis” points B<sub>1</sub>, B<sub>2</sub>, B<sub>3</sub>, and B<sub>4</sub> are located at 70% of the distances from the center to the horizontal or vertical boundaries. Directions of edges: horizontal (H) at B<sub>2</sub> and B<sub>4</sub>; vertical (V) at B<sub>1</sub> and B<sub>3</sub>; both H and V at A (the results should be the same for square pixels).</p>
Full article ">Figure 3
<p>MTF measurement flowchart.</p>
Full article ">Figure 4
<p>Image showing 20 gray patches on the extended ISO 12233:2017 Edge SFR chart, captured by the endoscope for OECF measurement. The average image pixel value within each red square represents the image luminance on that patch.</p>
Full article ">Figure 5
<p>MTF curves at locations A and B<sub>2</sub> under relatively uniform external incandescent light and non-uniform internal xenon light at different levels of intensities.</p>
Full article ">Figure 6
<p>The uniformity of image luminance. (<b>a</b>,<b>b</b>): images of the Spectralon target; (<b>c</b>,<b>d</b>): normalized image luminance through the B<sub>2</sub>−B<sub>4</sub> line, with location A as one; (<b>a</b>,<b>c</b>): incandescent light; (<b>b</b>,<b>d</b>) xenon light. The values at B<sub>2</sub> in (<b>c</b>,<b>d</b>) are 1.07 and 0.44, respectively.</p>
Full article ">Figure 7
<p>The impacts of <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>M</mi> </mrow> <mrow> <mi>c</mi> <mi>h</mi> <mi>a</mi> <mi>r</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math> compensation. Colored dashed curves: the measured MTF without compensation (<math display="inline"><semantics> <mrow> <mi>M</mi> <mi>T</mi> <msub> <mrow> <mi>F</mi> </mrow> <mrow> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>s</mi> <mi>u</mi> <mi>r</mi> <mi>e</mi> <mi>d</mi> </mrow> </msub> </mrow> </semantics></math>), with the orange curve in (<b>a</b>) for the center position A and the others in (<b>b</b>–<b>d</b>) artificially generated. Colored solid curves: the <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>M</mi> </mrow> <mrow> <mi>c</mi> <mi>h</mi> <mi>a</mi> <mi>r</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math> compensated <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>T</mi> <mi>F</mi> </mrow> </semantics></math> of the dashed curves with the same color. Different colors represent different endoscope MTFs. Black dashed curves: <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>M</mi> </mrow> <mrow> <mi>c</mi> <mi>h</mi> <mi>a</mi> <mi>r</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math> to calculate the compensated <math display="inline"><semantics> <mrow> <mi>M</mi> <mi>T</mi> <mi>F</mi> </mrow> </semantics></math> in the same graph. Dashed vertical lines: <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>f</mi> </mrow> <mrow> <mi>N</mi> <mi>y</mi> <mi>q</mi> </mrow> </msub> </mrow> </semantics></math>. Solid vertical lines in (<b>a</b>,<b>c</b>,<b>d</b>): 0.3, 0.5, and 0.7 times of <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>f</mi> </mrow> <mrow> <mi>N</mi> <mi>y</mi> <mi>q</mi> </mrow> </msub> </mrow> </semantics></math>, respectively.</p>
Full article ">Figure 8
<p>MTF curves for different encoding gamma values measured with low- and high-contrast charts. The vertical line is half Nyquist frequency.</p>
Full article ">Figure 9
<p>Effects of AGC on MTF curves.</p>
Full article ">Figure 10
<p>Effects of image enhancement on MTF curves.</p>
Full article ">Figure 11
<p>The MTF curves for an endoscope TIFF image and its compressed JPEG images (The JPEG quality scalar represents the compression level, with 100 indicating the lowest compression and highest quality). (<b>a</b>): The MTF curves calculated from the images captured by the endoscope; (<b>b</b>): The MTF curves calculated from the TIFF image captured by the endoscope and the images compressed with MATLAB from the TIFF image.</p>
Full article ">Figure 12
<p>Blocky compression artifacts of edge images (80 × 60 pixels).</p>
Full article ">Figure 13
<p>The MTF curves (<b>a</b>) for a TIFF image (<b>b</b>) from an online source [<a href="#B24-sensors-24-05075" class="html-bibr">24</a>] and its compressed JPEG images.</p>
Full article ">Figure 14
<p>MTF curves based on different ROI dimensions (<b>a</b>,<b>b</b>) and edge profile (<b>c</b>).</p>
Full article ">
19 pages, 746 KiB  
Article
Fast Depth Map Coding Algorithm for 3D-HEVC Based on Gradient Boosting Machine
by Xiaoke Su, Yaqiong Liu and Qiuwen Zhang
Electronics 2024, 13(13), 2586; https://doi.org/10.3390/electronics13132586 - 1 Jul 2024
Viewed by 1092
Abstract
Three-Dimensional High-Efficiency Video Coding (3D-HEVC) has been extensively researched due to its efficient compression and deep image representation, but encoding complexity continues to pose a difficulty. This is mainly attributed to redundancy in the coding unit (CU) recursive partitioning process and rate–distortion (RD) [...] Read more.
Three-Dimensional High-Efficiency Video Coding (3D-HEVC) has been extensively researched due to its efficient compression and deep image representation, but encoding complexity continues to pose a difficulty. This is mainly attributed to redundancy in the coding unit (CU) recursive partitioning process and rate–distortion (RD) cost calculation, resulting in a complex encoding process. Therefore, enhancing encoding efficiency and reducing redundant computations are key objectives for optimizing 3D-HEVC. This paper introduces a fast-encoding method for 3D-HEVC, comprising an adaptive CU partitioning algorithm and a rapid rate–distortion-optimization (RDO) algorithm. Based on the ALV features extracted from each coding unit, a Gradient Boosting Machine (GBM) model is constructed to obtain the corresponding CU thresholds. These thresholds are compared with the ALV to further decide whether to continue dividing the coding unit. The RDO algorithm is used to optimize the RD cost calculation process, selecting the optimal prediction mode as much as possible. The simulation results show that this method saves 52.49% of complexity while ensuring good video quality. Full article
Show Figures

Figure 1

Figure 1
<p>Example of quadtree division from CTU to CU.</p>
Full article ">Figure 2
<p>Predictive mode diagram of 3D-HEVC.</p>
Full article ">Figure 3
<p>Analysis of encoding unit complexity.</p>
Full article ">Figure 4
<p>Process of local variance calculation.</p>
Full article ">Figure 5
<p>Cumulative Distribution of CU sizes in depth maps for ALV.</p>
Full article ">Figure 6
<p>Depicting the method flow based on the GBM model.</p>
Full article ">Figure 7
<p>Threshold curves for CU sizes in 3D-HEVC and their modeling functions.</p>
Full article ">Figure 8
<p>Proportion of time consumption for bitrate prediction, depth distortion calculation, and SVD calculation in different sequences.</p>
Full article ">Figure 9
<p>Flowchart of fast RDO algorithm.</p>
Full article ">Figure 10
<p>Experimental results of RD trajectory.</p>
Full article ">
17 pages, 4537 KiB  
Article
Video Multi-Scale-Based End-to-End Rate Control in Deep Contextual Video Compression
by Lili Wei, Zhenglong Yang, Hua Zhang, Xinyu Liu, Weihao Deng and Youchao Zhang
Appl. Sci. 2024, 14(13), 5573; https://doi.org/10.3390/app14135573 - 26 Jun 2024
Cited by 1 | Viewed by 1074
Abstract
In recent years, video data have increased in size, which results in enormous transmission pressure. Rate control plays an important role in stabilizing video stream transmissions by balancing the rate and distortion of video compression. To achieve high-quality videos through low-bandwidth transmission, video [...] Read more.
In recent years, video data have increased in size, which results in enormous transmission pressure. Rate control plays an important role in stabilizing video stream transmissions by balancing the rate and distortion of video compression. To achieve high-quality videos through low-bandwidth transmission, video multi-scale-based end-to-end rate control is proposed. First, to reduce video data, the original video is processed using multi-scale bicubic downsampling as the input. Then, the end-to-end rate control model is implemented. By fully using the temporal coding correlation, a two-branch residual-based network and a two-branch regression-based network are designed to obtain the optimal bit rate ratio and Lagrange multiplier λ for rate control. For restoring high-resolution videos, a hybrid efficient distillation SISR network (HEDS-Net) is designed to build low-resolution and high-resolution feature dependencies, in which a multi-branch distillation network, a lightweight attention LCA block, and an upsampling network are used to transmit deep extracted frame features, enhance feature expression, and improve image detail restoration abilities, respectively. The experimental results show that the PSNR and SSIM BD rates of the proposed multi-scale-based end-to-end rate control are −1.24% and −0.50%, respectively, with 1.82% rate control accuracy. Full article
Show Figures

Figure 1

Figure 1
<p>Coding frameworks: (<b>a</b>) traditional hybrid coding framework; (<b>b</b>) end-to-end coding framework.</p>
Full article ">Figure 2
<p>Frame super-resolution-based end-to-end rate control.</p>
Full article ">Figure 3
<p>Two-branch residual-based network.</p>
Full article ">Figure 4
<p>Two-branch regression-based network.</p>
Full article ">Figure 5
<p>Multi-branch distillation network.</p>
Full article ">Figure 6
<p>Lightweight attention LCA block.</p>
Full article ">Figure 7
<p>Upsampling network.</p>
Full article ">Figure 8
<p>RD curve comparisons of (A): the proposed algorithm, (B): DCVC, (C): Li et al. [<a href="#B22-applsci-14-05573" class="html-bibr">22</a>] and (D): Li et al. [<a href="#B9-applsci-14-05573" class="html-bibr">9</a>].</p>
Full article ">Figure 9
<p>Visualization map of PSNR indexes for SRCNN, VDSR, EDSR, RCAN, and HEDS-Net.</p>
Full article ">Figure 10
<p>Subjective comparisons of BasketballDrive for the second frame @ 2320.8 kps and Cactus for the second frame @ 4261.38kps. (<b>a-1</b>,<b>a-2</b>) are the ground truth images; (<b>b-1</b>,<b>b-2</b>) are images from the proposed algorithm; (<b>c-1</b>,<b>c-2</b>) are the images from DCVC; (<b>d-1</b>,<b>d-2</b>) are the images from Li et al. [<a href="#B22-applsci-14-05573" class="html-bibr">22</a>]; (<b>e-1</b>,<b>e-2</b>) are the images from Li et al. [<a href="#B9-applsci-14-05573" class="html-bibr">9</a>].</p>
Full article ">
20 pages, 16671 KiB  
Article
A Light-Field Video Dataset of Scenes with Moving Objects Captured with a Plenoptic Video Camera
by Kamran Javidi and Maria G. Martini
Electronics 2024, 13(11), 2223; https://doi.org/10.3390/electronics13112223 - 6 Jun 2024
Viewed by 1067
Abstract
Light-field video provides a detailed representation of scenes captured from different perspectives. This results in a visualisation modality that enhances the immersion and engagement of the viewers with the depicted environment. In order to perform research on compression, transmission and signal processing of [...] Read more.
Light-field video provides a detailed representation of scenes captured from different perspectives. This results in a visualisation modality that enhances the immersion and engagement of the viewers with the depicted environment. In order to perform research on compression, transmission and signal processing of light field data, datasets with light-field contents of different categories and acquired with different modalities are required. In particular, the development of machine learning models for quality assessment and for light-field processing, including the generation of new views, require large amounts of data. Most existing datasets consist of static scenes and, in many cases, synthetic contents. This paper presents a novel light-field plenoptic video dataset, KULFR8, involving six real-world scenes with moving objects and 336 distorted light-field videos derived from the original contents; in total, the original scenes in the dataset contain 1800 distinctive frames, with angular resolution of 5×5 with and total spatial resolution of 9600×5400 pixels (considering all the views); overall, the dataset consists of 45,000 different views with spatial resolution of 1920×1080 pixels. We analyse the content characteristics based on the dimensions of the captured objects and via the acquired videos using the central views extracted from each quilted frame. Additionally, we encode and decode the contents using various video encoders across different bitrate ranges. For quality assessments, we consider all the views, utilising frames measuring 9600×5400 pixels, and employ two objective quality metrics: PSNR and SSIM. Full article
(This article belongs to the Special Issue Advances in Human-Centered Digital Systems and Services)
Show Figures

Figure 1

Figure 1
<p>Raytrix R8 view samples. (<b>a</b>–<b>f</b>) subfigures illustrate the view samples of Bee, Crab, Dinosaur, Magician, Mouse, and Water contents, respectively.</p>
Full article ">Figure 2
<p>Raytrix R8 central view depth samples with the scale bar in (cm). (<b>a</b>–<b>f</b>) subfigures illustrate the depth maps of Bee, Crab, Dinosaur, Magician, Mouse, and Water contents, respectively.</p>
Full article ">Figure 3
<p>The geometry details of each of the used models are as follows: (<b>a</b>) Bee, (<b>b</b>) Crab, (<b>c</b>) Dinosaur, (<b>d</b>) Magician, (<b>e</b>) Mouse, and two models are used for the Water content, which are (<b>f</b>) Fish and (<b>g</b>) Turtle.</p>
Full article ">Figure 4
<p>Content characterisation evaluation flows for (<b>a</b>) motion displacement characterisation flow, and (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>S</mi> <msub> <mi>I</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>T</mi> <msub> <mi>I</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>C</mi> <msub> <mi>F</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math> characterisation flow.</p>
Full article ">Figure 4 Cont.
<p>Content characterisation evaluation flows for (<b>a</b>) motion displacement characterisation flow, and (<b>b</b>) <math display="inline"><semantics> <mrow> <mi>S</mi> <msub> <mi>I</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>T</mi> <msub> <mi>I</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mi>C</mi> <msub> <mi>F</mi> <mrow> <mi>c</mi> <mi>v</mi> </mrow> </msub> </mrow> </semantics></math> characterisation flow.</p>
Full article ">Figure 5
<p>Each row of images represents one content in the different color spaces HSL, HSV, LAB, and YUV (in order from left to right for the first four sample images) followed by the (L) component for HSL, (V) component for HSV, (L) component for LAB, and (Y) component for YUV (sorted in each row from five to eight). The view samples reported in the six rows are from the contents ‘Bee’, ‘Crab’, ‘Dinosaur’, ‘Magician’, ‘Mouse’, and ‘Water’.</p>
Full article ">Figure 6
<p>Detected motion vectors; (<b>a</b>) ‘L’ component in HSL, (<b>b</b>) ‘V’ component in HSV, (<b>c</b>) ‘L’ component in LAB, and (<b>d</b>) ‘Y’ component in YUV.</p>
Full article ">Figure 7
<p>Overall motion vectors added on a sample view of ‘V’ component in HSV color space. (<b>a</b>–<b>f</b>) subfigures illustrate the motion vectors on Bee, Crab, Dinosaur, Magician, Mouse, and water contents, respectively.</p>
Full article ">Figure 8
<p>Content characterisation values in (<b>a</b>,<b>b</b>) present motion displacement values for HSV colour space in the vertical axis versus two of SI and TI, respectively, in the horizontal axis; (<b>c</b>,<b>d</b>) present colourfulness values in the vertical axis and SI and TI, respectively, in the horizontal axis. (<b>e</b>–<b>g</b>) represent SI versus TI values for three colour spaces of HSL, HSV, and YUV, respectively.</p>
Full article ">Figure 9
<p>Quality assessment encompasses two main components: (<b>a</b>) the encode–decode procedure applied to the light-field video contents, and (<b>b</b>) the evaluation of quality metrics such as PSNR and SSIM.</p>
Full article ">Figure 10
<p>Quality metric plots <math display="inline"><semantics> <mrow> <mi>P</mi> <mi>S</mi> <mi>N</mi> <msub> <mi>R</mi> <mrow> <mi>Y</mi> <mi>U</mi> <mi>V</mi> </mrow> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>S</mi> <mi>S</mi> <mi>I</mi> <msub> <mi>M</mi> <mrow> <mi>Y</mi> <mi>U</mi> <mi>V</mi> </mrow> </msub> </mrow> </semantics></math> for ‘Bee’, ‘Crab’, ‘Dinosaur’, ‘Magician’, ‘Mouse’, and ‘Water’ light-field video contents.</p>
Full article ">
15 pages, 2640 KiB  
Article
Toward Intraoperative Visual Intelligence: Real-Time Surgical Instrument Segmentation for Enhanced Surgical Monitoring
by Mostafa Daneshgar Rahbar, George Pappas and Nabih Jaber
Healthcare 2024, 12(11), 1112; https://doi.org/10.3390/healthcare12111112 - 29 May 2024
Viewed by 1465
Abstract
Background: Open surgery relies heavily on the surgeon’s visual acuity and spatial awareness to track instruments within a dynamic and often cluttered surgical field. Methods: This system utilizes a head-mounted depth camera to monitor surgical scenes, providing both image data and depth information. [...] Read more.
Background: Open surgery relies heavily on the surgeon’s visual acuity and spatial awareness to track instruments within a dynamic and often cluttered surgical field. Methods: This system utilizes a head-mounted depth camera to monitor surgical scenes, providing both image data and depth information. The video captured from this camera is scaled down, compressed using MPEG, and transmitted to a high-performance workstation via the RTSP (Real-Time Streaming Protocol), a reliable protocol designed for real-time media transmission. To segment surgical instruments, we utilize the enhanced U-Net with GridMask (EUGNet) for its proven effectiveness in surgical tool segmentation. Results: For rigorous validation, the system’s performance reliability and accuracy are evaluated using prerecorded RGB-D surgical videos. This work demonstrates the potential of this system to improve situational awareness, surgical efficiency, and generate data-driven insights within the operating room. In a simulated surgical environment, the system achieves a high accuracy of 85.5% in identifying and segmenting surgical instruments. Furthermore, the wireless video transmission proves reliable with a latency of 200 ms, suitable for real-time processing. Conclusions: These findings represent a promising step towards the development of assistive technologies with the potential to significantly enhance surgical practice. Full article
(This article belongs to the Section Artificial Intelligence in Medicine)
Show Figures

Figure 1

Figure 1
<p>The intraoperative visual intelligence system is designed to provide real-time surgical instrument segmentation and tracking during open surgical procedures. This system leverages a multi-component hardware setup to capture, process, and analyze surgical scenes. The key hardware components of the system include Intel<sup>®</sup> RealSense™ Depth Camera D455, UP 2 (UP Squared) Maker Board with Intel HD Graphic, Lenovo Legion Tower T7 341AZ7, and wireless transmission. This integrated hardware system enables the efficient capture, transmission, and processing of surgical video data, facilitating real-time instrument segmentation and tracking and ultimately enhancing surgical visualization and safety.</p>
Full article ">Figure 2
<p>Enhanced U-Net with GridMask (EUGNet) architecture for robust real-time surgical instrument segmentation: leveraging deep contextual encoding, adaptive feature fusion, and GridMask data augmentation and data collection for algorithm evaluation.</p>
Full article ">Figure 3
<p>rqt_graph-generated ROS communication.</p>
Full article ">Figure 4
<p>Qualitative comparison of our proposed convolutional architectures with GridMask data augmentation.</p>
Full article ">Figure 5
<p>Comparative performance of enhanced U-Net with and without GridMask data augmentation. This composite image presents a comprehensive evaluation of the impact of GridMask data augmentation on the performance of an enhanced U-Net model. The evaluation is conducted over 40 epochs and focuses on three key performance indicators: (<b>a</b>) Training Loss: The graph illustrates the evolution of training loss over the course of training. (<b>b</b>) Dice Coefficient: This metric assesses the overlap between the predicted segmentation and the ground truth. (<b>c</b>) Accuracy: The overall accuracy of the model is presented.</p>
Full article ">
24 pages, 96595 KiB  
Article
Modified ESRGAN with Uformer for Video Satellite Imagery Super-Resolution
by Kinga Karwowska and Damian Wierzbicki
Remote Sens. 2024, 16(11), 1926; https://doi.org/10.3390/rs16111926 - 27 May 2024
Viewed by 1167
Abstract
In recent years, a growing number of sensors that provide imagery with constantly increasing spatial resolution are being placed on the orbit. Contemporary Very-High-Resolution Satellites (VHRS) are capable of recording images with a spatial resolution of less than 0.30 m. However, until now, [...] Read more.
In recent years, a growing number of sensors that provide imagery with constantly increasing spatial resolution are being placed on the orbit. Contemporary Very-High-Resolution Satellites (VHRS) are capable of recording images with a spatial resolution of less than 0.30 m. However, until now, these scenes were acquired in a static way. The new technique of the dynamic acquisition of video satellite imagery has been available only for a few years. It has multiple applications related to remote sensing. However, in spite of the offered possibility to detect dynamic targets, its main limitation is the degradation of the spatial resolution of the image that results from imaging in video mode, along with a significant influence of lossy compression. This article presents a methodology that employs Generative Adversarial Networks (GAN). For this purpose, a modified ESRGAN architecture is used for the spatial resolution enhancement of video satellite images. In this solution, the GAN network generator was extended by the Uformer model, which is responsible for a significant improvement in the quality of the estimated SR images. This enhances the possibilities to recognize and detect objects significantly. The discussed solution was tested on the Jilin-1 dataset and it presents the best results for both the global and local assessment of the image (the mean values of the SSIM and PSNR parameters for the test data were, respectively, 0.98 and 38.32 dB). Additionally, the proposed solution, in spite of the fact that it employs artificial neural networks, does not require a high computational capacity, which means it can be implemented in workstations that are not equipped with graphic processors. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

Figure 1
<p>Diagram of enhancement of spatial resolution of a single video frame.</p>
Full article ">Figure 2
<p>Discriminator model [<a href="#B58-remotesensing-16-01926" class="html-bibr">58</a>].</p>
Full article ">Figure 3
<p>The flowchart of the algorithm.</p>
Full article ">Figure 4
<p>Examples of images from test data with quality results are shown in <a href="#remotesensing-16-01926-t003" class="html-table">Table 3</a>: (<b>a</b>) HR image, (<b>b</b>) MCWESRGAN with Uformer, (<b>c</b>) MCWESRGAN with Lucy–Richardson Algorithm, and (<b>d</b>) MCWESRGAN with Wiener deconvolution.</p>
Full article ">Figure 5
<p>Structural similarity between the estimated images (tiles) (SR) and the reference HR images.</p>
Full article ">Figure 6
<p>Peak signal-to-noise ratio (PSNR [dB]) between the estimated images (tiles) (SR) and the reference HR images.</p>
Full article ">Figure 7
<p>Local assessment—SSIM metrics (for the evaluated field of the size of 20 × 20 pixels).</p>
Full article ">Figure 8
<p>Local assessment—PSNR metrics (for the evaluated field of the size of 20 × 20 pixels).</p>
Full article ">Figure 9
<p>PSD diagram on the x and y directions for a sample image.</p>
Full article ">Figure 10
<p>Images in the frequency domain.</p>
Full article ">
Back to TopTop