[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (71)

Search Parameters:
Keywords = hierarchical encoding-decoding

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 6234 KiB  
Article
Data-Efficient Bone Segmentation Using Feature Pyramid- Based SegFormer
by Naohiro Masuda, Keiko Ono, Daisuke Tawara, Yusuke Matsuura and Kentaro Sakabe
Sensors 2025, 25(1), 81; https://doi.org/10.3390/s25010081 - 26 Dec 2024
Viewed by 240
Abstract
The semantic segmentation of bone structures demands pixel-level classification accuracy to create reliable bone models for diagnosis. While Convolutional Neural Networks (CNNs) are commonly used for segmentation, they often struggle with complex shapes due to their focus on texture features and limited ability [...] Read more.
The semantic segmentation of bone structures demands pixel-level classification accuracy to create reliable bone models for diagnosis. While Convolutional Neural Networks (CNNs) are commonly used for segmentation, they often struggle with complex shapes due to their focus on texture features and limited ability to incorporate positional information. As orthopedic surgery increasingly requires precise automatic diagnosis, we explored SegFormer, an enhanced Vision Transformer model that better handles spatial awareness in segmentation tasks. However, SegFormer’s effectiveness is typically limited by its need for extensive training data, which is particularly challenging in medical imaging, where obtaining labeled ground truths (GTs) is a costly and resource-intensive process. In this paper, we propose two models and their combination to enable accurate feature extraction from smaller datasets by improving SegFormer. Specifically, these include the data-efficient model, which deepens the hierarchical encoder by adding convolution layers to transformer blocks and increases feature map resolution within transformer blocks, and the FPN-based model, which enhances the decoder through a Feature Pyramid Network (FPN) and attention mechanisms. Testing our model on spine images from the Cancer Imaging Archive and our own hand and wrist dataset, ablation studies confirmed that our modifications outperform the original SegFormer, U-Net, and Mask2Former. These enhancements enable better image feature extraction and more precise object contour detection, which is particularly beneficial for medical imaging applications with limited training data. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

Figure 1
<p>SegFormer architecture.</p>
Full article ">Figure 2
<p>Proposed model architecture.</p>
Full article ">Figure 3
<p>Data-efficient encoder architecture.</p>
Full article ">Figure 4
<p>Model-wise IoU for spine images.</p>
Full article ">Figure 5
<p>Model-wise IoU for hand and Wrist images.</p>
Full article ">Figure 6
<p>Model-wise IoU for femur images.</p>
Full article ">Figure A1
<p>Datasets.</p>
Full article ">Figure A1 Cont.
<p>Datasets.</p>
Full article ">Figure A2
<p>Spine segmentation.</p>
Full article ">Figure A2 Cont.
<p>Spine segmentation.</p>
Full article ">Figure A3
<p>Hand and wrist segmentation.</p>
Full article ">Figure A3 Cont.
<p>Hand and wrist segmentation.</p>
Full article ">Figure A4
<p>Femur segmantation.</p>
Full article ">Figure A4 Cont.
<p>Femur segmantation.</p>
Full article ">
15 pages, 3365 KiB  
Article
Robust Automated Mouse Micro-CT Segmentation Using Swin UNEt TRansformers
by Lu Jiang, Di Xu, Qifan Xu, Arion Chatziioannou, Keisuke S. Iwamoto, Susanta Hui and Ke Sheng
Bioengineering 2024, 11(12), 1255; https://doi.org/10.3390/bioengineering11121255 - 11 Dec 2024
Viewed by 663
Abstract
Image-guided mouse irradiation is essential to understand interventions involving radiation prior to human studies. Our objective is to employ Swin UNEt TRansformers (Swin UNETR) to segment native micro-CT and contrast-enhanced micro-CT scans and benchmark the results against 3D no-new-Net (nnU-Net). Swin UNETR reformulates [...] Read more.
Image-guided mouse irradiation is essential to understand interventions involving radiation prior to human studies. Our objective is to employ Swin UNEt TRansformers (Swin UNETR) to segment native micro-CT and contrast-enhanced micro-CT scans and benchmark the results against 3D no-new-Net (nnU-Net). Swin UNETR reformulates mouse organ segmentation as a sequence-to-sequence prediction task using a hierarchical Swin Transformer encoder to extract features at five resolution levels, and it connects to a Fully Convolutional Neural Network (FCNN)-based decoder via skip connections. The models were trained and evaluated on open datasets, with data separation based on individual mice. Further evaluation on an external mouse dataset acquired on a different micro-CT with lower kVp and higher imaging noise was also employed to assess model robustness and generalizability. The results indicate that Swin UNETR consistently outperforms nnU-Net and AIMOS in terms of the average dice similarity coefficient (DSC) and the Hausdorff distance (HD95p), except in two mice for intestine contouring. This superior performance is especially evident in the external dataset, confirming the model’s robustness to variations in imaging conditions, including noise and quality, and thereby positioning Swin UNETR as a highly generalizable and efficient tool for automated contouring in pre-clinical workflows. Full article
(This article belongs to the Special Issue AI and Data Science in Bioengineering: Innovations and Applications)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Swin UNETR architecture and 3D nnU-Net architecture used in this study.</p>
Full article ">Figure 2
<p>Example of the median-scored case in mouse multi-organ segmentation in coronal view from the NACT test set. Yellow arrows highlight key differences in segmentation outcomes between the two 3D models, Swin UNETR and nnU-Net.</p>
Full article ">Figure 3
<p>Example of the median-scored case in mouse multi-organ segmentation in coronal view from the CECT test set. Yellow arrows highlight key differences in segmentation outcomes between the two 3D models, Swin UNETR and nnU-Net.</p>
Full article ">Figure 4
<p>Example of the median-scored case in mouse multi-organ segmentation in coronal view from the IHCT test set. Yellow arrows highlight key differences in segmentation outcomes between the three models, 3D Swin UNETR, 3D nnU-Net, and 2D AIMOS.</p>
Full article ">Figure 5
<p>Box plots of DSC (%) and HD95p (mm) per organ for predictions by Swin UNETR (green) vs. nnU-Net (orange). Each box extends from the lower to the upper quartile values of the data, with a black line at the median; the whiskers extend to the outermost data point within 1.5 times the interquartile range.</p>
Full article ">Figure 6
<p>DSC (%) performance comparisons for each individual mouse in the IHCT test set.</p>
Full article ">
18 pages, 688 KiB  
Article
A Unified Model for Chinese Cyber Threat Intelligence Flat Entity and Nested Entity Recognition
by Jiayi Yu, Yuliang Lu, Yongheng Zhang, Yi Xie, Mingjie Cheng and Guozheng Yang
Electronics 2024, 13(21), 4329; https://doi.org/10.3390/electronics13214329 - 4 Nov 2024
Viewed by 859
Abstract
In recent years, as cybersecurity threats have become increasingly severe and cyberattacks have occurred frequently, higher requirements have been put forward for cybersecurity protection. Therefore, the Named Entity Recognition (NER) technique, which is the cornerstone of Cyber Threat Intelligence (CTI) analysis, is particularly [...] Read more.
In recent years, as cybersecurity threats have become increasingly severe and cyberattacks have occurred frequently, higher requirements have been put forward for cybersecurity protection. Therefore, the Named Entity Recognition (NER) technique, which is the cornerstone of Cyber Threat Intelligence (CTI) analysis, is particularly important. However, most existing NER studies are limited to recognizing single-layer flat entities, ignoring the possible nested entities in CTI. On the other hand, most of the existing studies focus on English CTIs, and the existing models performed poorly in a limited number of Chinese CTI studies. Given the above challenges, we propose in this paper a novel unified model, RBTG, which aims to identify flat and nested entities in Chinese CTI effectively. To overcome the difficult boundary recognition problem and the direction-dependent and distance-dependent properties in Chinese CTI NER, we use Global Pointer as the decoder and TENER as the encoder layer, respectively. Specifically, the Global Pointer layer solves the problem of the insensitivity of general NER methods to entity boundaries by utilizing the relative position information and the multiplicative attention mechanism. The TENER layer adapts to the Chinese CTI NER task by introducing an attention mechanism with direction awareness and distance awareness. Meanwhile, to cope with the complex feature capture of hierarchical structure and dependencies among Chinese CTI nested entities, the TENER layer solves the problem by following the structure of multiple self-attention layers and feed-forward network layers superimposed on each other in the Transformer. In addition, to fill the gap in the Chinese CTI nested entity dataset, we further apply the Large Language Modeling (LLM) technique and domain knowledge to construct a high-quality Chinese CTI nested entity dataset, CDTinee, which consists of six entity types selected from STIX, including nearly 4000 entity types extracted from more than 3000 threatening sentences. In the experimental session, we conduct extensive experiments on multiple datasets, and the results show that the proposed model RBTG outperforms the baseline model in both flat NER and nested NER. Full article
(This article belongs to the Special Issue New Challenges in Cyber Security)
Show Figures

Figure 1

Figure 1
<p>Nested entity example. “She zhi mi guan you bu gong ji zhe” (setting up honeypots to trap attackers) is an entity of the course-of-action type, while “mi guan” (honeypots) is an entity of the infrastructure type, nested within it. The word “tong guo” (by) is followed by either an attack or a defense action, which has a clear direction dependency, while “wang luo an quan ce lue” (network security strategy) is farther away from the entity of the defense action, which has a clear distance dependency.</p>
Full article ">Figure 2
<p>Architecture of RBTG. The text of the input layer implies setting up honeypots to trap attackers.</p>
Full article ">Figure 3
<p>Example of spanning matrix to identify entities in "she zhi mi guan you bu gong ji zhe"(setting up honeypots to trap attackers), where 1 indicates the presence of an entity at the location and 0 indicates the absence of an entity.</p>
Full article ">Figure 4
<p>Labeling method example. For multiple entities in a text, CDTinee uses a JSON array to list the types and indexes of the entities.</p>
Full article ">Figure 5
<p>A type of attack pattern prompt example. Different colors represent different components of the prompt.</p>
Full article ">
18 pages, 1002 KiB  
Article
Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model
by Yue Yang, Tie Liu, Ying Pu, Liangchen Liu, Qijun Zhao and Qun Wan
Remote Sens. 2024, 16(21), 4083; https://doi.org/10.3390/rs16214083 - 1 Nov 2024
Cited by 1 | Viewed by 942
Abstract
Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. [...] Read more.
Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. This degrades performance since aerial images possess distinctive characteristics compared to natural images. In addition, it is challenging to capture the data distribution and perceive contextual information between samples, resulting in limited robustness and generalization of the feature representations. Furthermore, their focus on inherent most change-aware discriminative information is insufficient by directly aggregating all features. To deal with these problems, a novel framework entitled Multi-Attentive network with Diffusion model for RSICC (MADiffCC) is proposed in this work. Specifically, we introduce a diffusion feature extractor based on RS image dataset pre-trained diffusion model to capture the multi-level and multi-time-step feature representations of bitemporal RS images. The diffusion model is able to learn the training data distribution and contextual information of RS objects from which more robust and generalized representations could be extracted for the downstream application of change captioning. Furthermore, a time-channel-spatial attention (TCSA) mechanism based difference encoder is designed to utilize the extracted diffusion features to obtain the discriminative information. A gated multi-head cross-attention (GMCA)-guided change captioning decoder is then proposed to select and fuse crucial hierarchical features for more precise change description generation. Experimental results on the publicly available LEVIR-CC, LEVIRCCD, and DUBAI-CC datasets verify that the developed approach could realize state-of-the-art (SOTA) performance. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

Figure 1
<p>Overall framework of the proposed MADiffCC. It includes three main components: a diffusion feature extractor based on RS image dataset pre-trained DDPM to retrieve semantically related multi-level and multi-time-step feature representations, a TCSA-based difference encoder to effectively obtain most change-aware difference representations, and a GMCA-guided change captioning decoder to learn critical difference information for generating change descriptions.</p>
Full article ">Figure 2
<p>Visualization of the TCSA-based difference encoder.</p>
Full article ">Figure 3
<p>Visualization of the GMCA-guided change captioning decoder.</p>
Full article ">Figure 4
<p>Qualitative comparison of change captioning results. GT: ground truth caption, (a): Chg2Cap [<a href="#B14-remotesensing-16-04083" class="html-bibr">14</a>], (b): the proposed MADiffCC model. Words marked in green stand for more precise and detailed predicted change objects by the proposed method, while the red text indicates inaccurate representations. The red boxes in image pairs 1 and 2 indicate the small object undetected by benchmark method.</p>
Full article ">
17 pages, 5437 KiB  
Article
ChartLine: Automatic Detection and Tracing of Curves in Scientific Line Charts Using Spatial-Sequence Feature Pyramid Network
by Wenjin Yang, Jie He and Qian Li
Sensors 2024, 24(21), 7015; https://doi.org/10.3390/s24217015 - 31 Oct 2024
Viewed by 688
Abstract
Line charts are prevalent in scientific documents and commercial data visualization, serving as essential tools for conveying data trends. Automatic detection and tracing of line paths in these charts is crucial for downstream tasks such as data extraction, chart quality assessment, plagiarism detection, [...] Read more.
Line charts are prevalent in scientific documents and commercial data visualization, serving as essential tools for conveying data trends. Automatic detection and tracing of line paths in these charts is crucial for downstream tasks such as data extraction, chart quality assessment, plagiarism detection, and visual question answering. However, line graphs present unique challenges due to their complex backgrounds and diverse curve styles, including solid, dashed, and dotted lines. Existing curve detection algorithms struggle to address these challenges effectively. In this paper, we propose ChartLine, a novel network designed for detecting and tracing curves in line graphs. Our approach integrates a Spatial-Sequence Attention Feature Pyramid Network (SSA-FPN) in both the encoder and decoder to capture rich hierarchical representations of curve structures and boundary features. The model incorporates a Spatial-Sequence Fusion (SSF) module and a Channel Multi-Head Attention (CMA) module to enhance intra-class consistency and inter-class distinction. We evaluate ChartLine on four line chart datasets and compare its performance against state-of-the-art curve detection, edge detection, and semantic segmentation methods. Extensive experiments demonstrate that our method significantly outperforms existing algorithms, achieving an F-measure of 94% on a synthetic dataset. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

Figure 1
<p>Two samples with variable curve types and indistinguishable noise and curves. Both images are from the FigureSeer dataset, (<b>a</b>) is a gray image and (<b>b</b>) is color image.</p>
Full article ">Figure 2
<p>An illustration of the ChartLine network. The encoder and decoder are used to construct multi-scale features, the attention feature pyramid is used to capture the long dependencies of the curves and adaptively distinguish between the foreground and background, and the decoder’s output is a curve probability map. Conv* represents Conv1–5.</p>
Full article ">Figure 3
<p>The details of the Spatial-Sequence Fusion module.</p>
Full article ">Figure 4
<p>The details of the Channel Multi-Head Attention module.</p>
Full article ">Figure 5
<p>Precision–recall curves on the four test datasets. (<b>a</b>) EXC, (<b>b</b>) Chart2019, (<b>c</b>) FS, and (<b>d</b>) PMC. The red dots in the figure are the optimal values for each algorithm.</p>
Full article ">Figure 6
<p>Curve detection results from seven different methods are shown across five types of line chart images. The top row displays the original source images, while each subsequent row presents the detection results from a different algorithm. Lines detected by each method are marked in green. Red boxes indicate specific errors, providing a zoomed view of areas where the detected lines and the true curves do not align.</p>
Full article ">Figure 7
<p>P-R curves with different loss functions on four datasets. (<b>a</b>) The result of ChartLine-BCE. (<b>b</b>) The result of ChartLine.</p>
Full article ">Figure 8
<p>The detection results of different modules: (<b>a</b>) the raw image, (<b>b</b>) the detection result of ChartLine-SSF, (<b>c</b>) the detection result of ChartLine-CMA, and (<b>d</b>) the detection result of ChartLine. The red circles in the image zoom in on local details for easy comparison.</p>
Full article ">
21 pages, 37600 KiB  
Article
A Multi-Hierarchical Complementary Feature Interaction Network for Accelerated Multi-Modal MR Imaging
by Haotian Zhang, Qiaoyu Ma, Yiran Qiu and Zongying Lai
Appl. Sci. 2024, 14(21), 9764; https://doi.org/10.3390/app14219764 - 25 Oct 2024
Viewed by 696
Abstract
Magnetic resonance (MR) imaging is widely used in the clinical field due to its non-invasiveness, but the long scanning time is still a bottleneck for its popularization. Using the complementary information between multi-modal imaging to accelerate imaging provides a novel and effective MR [...] Read more.
Magnetic resonance (MR) imaging is widely used in the clinical field due to its non-invasiveness, but the long scanning time is still a bottleneck for its popularization. Using the complementary information between multi-modal imaging to accelerate imaging provides a novel and effective MR fast imaging solution. However, previous technologies mostly use simple fusion methods and fail to fully utilize their potential sharable knowledge. In this study, we introduced a novel multi-hierarchical complementary feature interaction network (MHCFIN) to realize joint reconstruction of multi-modal MR images with undersampled data and thus accelerate multi-modal imaging. Firstly, multiple attention mechanisms are integrated with a dual-branch encoder–decoder network to represent shared features and complementary features of different modalities. In the decoding stage, the multi-modal feature interaction module (MMFIM) acts as a bridge between the two branches, realizing complementary knowledge transfer between different modalities through cross-level fusion. The single-modal feature fusion module (SMFFM) carries out multi-scale feature representation and optimization of the single modality, preserving better anatomical details. Extensive experiments are conducted under different sampling patterns and acceleration factors. The results show that this proposed method achieves obvious improvement compared with existing state-of-the-art reconstruction methods in both visual quality and quantity. Full article
Show Figures

Figure 1

Figure 1
<p>Illustration of different MRI modalities and the reconstruction comparison results on fastMRI brain datasets. (<b>a</b>) Fully-sampled T1WI and T2WI. (<b>b</b>) Zero-filled T1WI and T2WI with 1D random sampling at 8× accelerate factors. (<b>c</b>) Reconstruction results of DuDoRNet [<a href="#B14-applsci-14-09764" class="html-bibr">14</a>]. (<b>d</b>) Reconstruction results of STUN [<a href="#B15-applsci-14-09764" class="html-bibr">15</a>]. (<b>e</b>) Reconstruction results of MHCFIN (Ours). Green boxes highlight detailed structures.</p>
Full article ">Figure 2
<p>Overall architecture of the proposed multi-hierarchical complementary feature interaction network (MHCFIN).</p>
Full article ">Figure 3
<p>Detailed architecture of the proposed encoder–decoder and detailed construction of triple attention.</p>
Full article ">Figure 4
<p>Detailed structure of the three types of attention (channel, spatial, and gate attention).</p>
Full article ">Figure 5
<p>The proposed multi-modal feature interaction module consists of a double cross-attention. To interact with multi-scale features of different modalities at different hierarchical.</p>
Full article ">Figure 6
<p>Detailed architecture of the single-modal feature fusion module.</p>
Full article ">Figure 7
<p>Qualitative comparison with different reconstruction methods using 1D random undersampling pattern with acceleration factor 8× on fastMRI brain datasets. Ground truth (GT), zero-filled (ZF), reconstructed MR images (T1WI and T2WI), error maps, and zoomed-in details are provided.</p>
Full article ">Figure 8
<p>Qualitative comparison with different reconstruction methods using 1D equispaced undersampling pattern with acceleration factor 12× on fastMRI brain datasets.</p>
Full article ">Figure 9
<p>Qualitative comparison with different reconstruction methods using 1D random undersampling pattern with acceleration factor 10× on fastMRI knee datasets.</p>
Full article ">Figure 10
<p>Bar charts for (<b>a</b>) PSNR, (<b>b</b>) SSIM, and (<b>c</b>) RLNE depicting the performance of various reconstruction methods using an 8× acceleration factor mask on the fastMRI brain dataset. The black arrows represent the standard deviation (mean <math display="inline"><semantics> <mrow> <mo>±</mo> </mrow> </semantics></math> standard deviation) for each method.</p>
Full article ">Figure 11
<p>The training loss of different reconstruction methods on the fastMRI brain dataset.</p>
Full article ">Figure 12
<p>Ablation study of the key components in the proposed method using 1D random under-sampling pattern with acceleration factor 8× on fastMRI brain datasets.</p>
Full article ">
16 pages, 7008 KiB  
Article
Improving Top-Down Attention Network in Speech Separation by Employing Hand-Crafted Filterbank and Parameter-Sharing Transformer
by Aye Nyein Aung and Jeih-weih Hung
Electronics 2024, 13(21), 4174; https://doi.org/10.3390/electronics13214174 - 24 Oct 2024
Viewed by 715
Abstract
The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing [...] Read more.
The “cocktail party problem”, the challenge of isolating individual speech signals from a noisy mixture, has traditionally been addressed using statistical methods. However, deep neural networks (DNNs), with their ability to learn complex patterns, have emerged as superior solutions. DNNs excel at capturing intricate relationships between mixed audio signals and their respective speech sources, enabling them to effectively separate overlapping speech signals in challenging acoustic environments. Recent advances in speech separation systems have drawn inspiration from the brain’s hierarchical sensory information processing, incorporating top-down attention mechanisms. The top-down attention network (TDANet) employs an encoder–decoder architecture with top-down attention to enhance feature modulation and separation performance. By leveraging attention signals from multi-scale input features, TDANet effectively modifies features across different scales using a global attention (GA) module in the encoder–decoder design. Local attention (LA) layers then convert these modulated signals into high-resolution auditory characteristics. In this study, we propose two key modifications to TDANet. First, we substitute the fully trainable convolutional encoder with a deterministic hand-crafted multi-phase gammatone filterbank (MP-GTF), which mimics human hearing. Experimental results demonstrated that this substitution yielded comparable or even slightly superior performance to the original TDANet with a trainable encoder. Second, we replace the single multi-head self-attention (MHSA) layer in the global attention module with a transformer encoder block consisting of multiple MHSA layers. To optimize GPU memory utilization, we introduce a parameter sharing mechanism, dubbed “Reverse Cycle”, across layers in the transformer-based encoder. Our experimental findings indicated that these proposed modifications enabled TDANet to achieve competitive separation performance, rivaling state-of-the-art techniques, while maintaining superior computational efficiency. Full article
(This article belongs to the Special Issue Natural Language Processing Method: Deep Learning and Deep Semantics)
Show Figures

Figure 1

Figure 1
<p>Schematic representation of a speech separation framework. The system processes an input mixture signal to produce two separated source signals for a two-speaker scenario.</p>
Full article ">Figure 2
<p>Overall architecture of separation network within TDANet (redrawn from to [<a href="#B31-electronics-13-04174" class="html-bibr">31</a>]).</p>
Full article ">Figure 3
<p>Temporal and spectral domain representations of all filters in the MP-GTF, with the number of filters <span class="html-italic">N</span> set to 128.</p>
Full article ">Figure 4
<p>TDANet with MP-GTF as audio encoder.</p>
Full article ">Figure 5
<p>Comparison between (<b>a</b>) a single transformer block and (<b>b</b>) a six-layer parameter-sharing transformer block (transformer encoder layers with same color denote parameter sharing layers).</p>
Full article ">Figure 6
<p>Comparison of spectrograms for ground truth sources and estimated sources produced by the baseline TDANet model and the proposed methods A and B. (<b>a</b>) Waveform and spectrogram of the audio mixture. (<b>b</b>) Ground truth spectrograms of speaker source 1 and speaker source 2. (<b>c</b>) Estimated spectrograms of speaker source 1 and speaker source 2 using the baseline TDANet model. (<b>d</b>) Estimated spectrograms of speaker source 1 and speaker source 2 using proposed method A. (<b>e</b>) Estimated spectrograms of speaker source 1 and speaker source 2 using proposed method B. The areas outlined by the red box are higher frequency bands.</p>
Full article ">
14 pages, 1281 KiB  
Article
A Flexible Hierarchical Framework for Implicit 3D Characterization of Bionic Devices
by Yunhong Lu, Xiangnan Li and Mingliang Li
Biomimetics 2024, 9(10), 590; https://doi.org/10.3390/biomimetics9100590 - 29 Sep 2024
Viewed by 694
Abstract
In practical applications, integrating three-dimensional models of bionic devices with simulation systems can predict their behavior and performance under various operating conditions, providing a basis for subsequent engineering optimization and improvements. This study proposes a framework for characterizing three-dimensional models of objects, focusing [...] Read more.
In practical applications, integrating three-dimensional models of bionic devices with simulation systems can predict their behavior and performance under various operating conditions, providing a basis for subsequent engineering optimization and improvements. This study proposes a framework for characterizing three-dimensional models of objects, focusing on extracting 3D structures and generating high-quality 3D models. The core concept involves obtaining the density output of the model from multiple images to enable adaptive boundary surface detection. The framework employs a hierarchical octree structure to partition the 3D space based on surface and geometric complexity. This approach includes recursive encoding and decoding of the octree structure and surface geometry, ultimately leading to the reconstruction of the 3D model. The framework has been validated through a series of experiments, yielding positive results. Full article
(This article belongs to the Special Issue Biomimetic Aspects of Human–Computer Interactions)
Show Figures

Figure 1

Figure 1
<p>A diagram of a hierarchical octree neural network in two dimensions. In this paper, a recursive encoder–decoder network is proposed, which is trained using several GAN methods. Here, the geometry of the octree is encoded using the voxel 3DCNN and recursively aggregated using the hierarchical structure and geometric features of the local encoder <math display="inline"><semantics> <msub> <mi>ε</mi> <mi>i</mi> </msub> </semantics></math>. The decoding function is implemented by a local decoder <math display="inline"><semantics> <msub> <mi>D</mi> <mi>i</mi> </msub> </semantics></math> hierarchy with a mirror structure relative to the encoder. The structural and geometric information of the input model is decoded recursively, and the local geometric surfaces are recovered with input of an implicit decoder embedded in each octree.</p>
Full article ">Figure 2
<p>The structure of the encoder and decoder <math display="inline"><semantics> <msub> <mi>E</mi> <mi>k</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>D</mi> <mi>k</mi> </msub> </semantics></math>, respectively. <math display="inline"><semantics> <msub> <mi>E</mi> <mi>k</mi> </msub> </semantics></math> collects the structure (<math display="inline"><semantics> <mrow> <msub> <mi>a</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> <mo>,</mo> <msub> <mi>b</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> </mrow> </semantics></math>) and geometric characteristics (<math display="inline"><semantics> <msub> <mi>g</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> </semantics></math>) of the child octrees into its parent octree <span class="html-italic">k</span>, where <math display="inline"><semantics> <msub> <mi>c</mi> <mi>j</mi> </msub> </semantics></math> is in <math display="inline"><semantics> <msub> <mi>C</mi> <mi>k</mi> </msub> </semantics></math>, utilizing an MLP, maximum set operation, and second MLP. Two MLPs and classifiers decode the geometric features <math display="inline"><semantics> <msub> <mi>g</mi> <mi>k</mi> </msub> </semantics></math> of the parent space into geometric features <math display="inline"><semantics> <msub> <mi>g</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> </semantics></math> and two attributes <math display="inline"><semantics> <msub> <mi>α</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> </semantics></math>,<math display="inline"><semantics> <msub> <mi>β</mi> <msub> <mi>c</mi> <mi>j</mi> </msub> </msub> </semantics></math> of the child space. Two metrics are employed to determine the probability of surface occupation and the need for substructure subdivision.</p>
Full article ">Figure 3
<p>This figure shows the depth generation model, which is composed of <span class="html-italic">n</span> combined network layers of linear network and implicit network, and the implicit network core uses <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>i</mi> <mi>n</mi> </mrow> </semantics></math> function as the calculation method. The multi-layer perceptron takes the position information <span class="html-italic">x</span> and the noise information <span class="html-italic">z</span> processed by the mapping network layer as the input, and the final output is the density. (<b>a</b>) Overall architecture of the network model. (<b>b</b>) Specific structure of the FiLM SIREN unit.</p>
Full article ">Figure 4
<p>Shape reconstruction comparison of (<b>a</b>) LIG [<a href="#B5-biomimetics-09-00590" class="html-bibr">5</a>], (<b>b</b>) OccNet [<a href="#B9-biomimetics-09-00590" class="html-bibr">9</a>], (<b>c</b>) IM-Net [<a href="#B4-biomimetics-09-00590" class="html-bibr">4</a>], (<b>d</b>) OctField [<a href="#B6-biomimetics-09-00590" class="html-bibr">6</a>], and (<b>e</b>) the work of this paper.</p>
Full article ">Figure 5
<p>The result of modeling the detail part of the aircraft model.</p>
Full article ">Figure 6
<p>Shape generation. The image shows the results generated by randomly sampling potential codes in the potential space.</p>
Full article ">Figure 7
<p>Shape interpolation. The figure shows the results of two types of interpolation: table and chair. (<b>a</b>) The source shape and (<b>f</b>) the target shape. (<b>b</b>–<b>e</b>) is the intermediate result of interpolation.</p>
Full article ">
23 pages, 25042 KiB  
Article
Segmentation Network for Multi-Shape Tea Bud Leaves Based on Attention and Path Feature Aggregation
by Tianci Chen, Haoxin Li, Jinhong Lv, Jiazheng Chen and Weibin Wu
Agriculture 2024, 14(8), 1388; https://doi.org/10.3390/agriculture14081388 - 17 Aug 2024
Viewed by 669
Abstract
Accurately detecting tea bud leaves is crucial for the automation of tea picking robots. However, challenges arise due to tea stem occlusion and overlapping of buds and leaves, presenting varied shapes of one bud–one leaf targets in the field of view, making precise [...] Read more.
Accurately detecting tea bud leaves is crucial for the automation of tea picking robots. However, challenges arise due to tea stem occlusion and overlapping of buds and leaves, presenting varied shapes of one bud–one leaf targets in the field of view, making precise segmentation of tea bud leaves challenging. To improve the segmentation accuracy of one bud–one leaf targets with different shapes and fine granularity, this study proposes a novel semantic segmentation model for tea bud leaves. The method designs a hierarchical Transformer block based on a self-attention mechanism in the encoding network, which is beneficial for capturing long-range dependencies between features and enhancing the representation of common features. Then, a multi-path feature aggregation module is designed to effectively merge the feature outputs of encoder blocks with decoder outputs, thereby alleviating the loss of fine-grained features caused by downsampling. Furthermore, a refined polarized attention mechanism is employed after the aggregation module to perform polarized filtering on features in channel and spatial dimensions, enhancing the output of fine-grained features. The experimental results demonstrate that the proposed Unet-Enhanced model achieves segmentation performance well on one bud–one leaf targets with different shapes, with a mean intersection over union (mIoU) of 91.18% and a mean pixel accuracy (mPA) of 95.10%. The semantic segmentation network can accurately segment tea bud leaves, providing a decision-making basis for the spatial positioning of tea picking robots. Full article
(This article belongs to the Section Digital Agriculture)
Show Figures

Figure 1

Figure 1
<p>Tea garden environment and label schematics.</p>
Full article ">Figure 2
<p>Segmentation network architecture. Note: Conv1 × 1: Convolution operation with kernel 1 × 1; Conv3 × 3: Convolution operation with kernel 3 × 3; BN: BatchNomalization; Upsampling2D: feature upsample; ReLU: Activate function; Linear: Linear transformation; Interpolate: Bilinear interpolation.</p>
Full article ">Figure 3
<p>Computation of Transformer block.</p>
Full article ">Figure 4
<p>Path feature aggregation module.</p>
Full article ">Figure 5
<p>Polarized attention mechanism module.</p>
Full article ">Figure 6
<p>Results of training and testing: (<b>a</b>) Loss curve for the training dataset; (<b>b</b>) Statistics of the mIoU for the testing dataset.</p>
Full article ">Figure 7
<p>Comparison of segmentation performance of different models.</p>
Full article ">Figure 8
<p>Comparison of segmentation results for large-size tea bud leaves.</p>
Full article ">Figure 9
<p>Comparison of segmentation results for multi-target and fine-grained tea bud leaves.</p>
Full article ">Figure 10
<p>Comparison of segmentation results of different network models. (<b>A</b>) Original image. (<b>B</b>) Ground truth. (<b>C</b>) DeepLabv3+. (<b>D</b>) PSPNet. (<b>E</b>) Hrnet. (<b>F</b>) Segformer. (<b>G</b>) Unet-Enhanced.</p>
Full article ">Figure 11
<p>Segmentation effect of tea bud leaves with different shapes and fine-grained features. (<b>a</b>) mainly “tea_I”. (<b>b</b>) mainly “tea_V”. (<b>c</b>) mainly “tea_Y”.</p>
Full article ">Figure 12
<p>Failure cases of Unet-Enhanced.</p>
Full article ">Figure 13
<p>Shallow feature visualization.</p>
Full article ">Figure 14
<p>Deep feature visualization.</p>
Full article ">Figure 15
<p>Unet heat map. (<b>a</b>–<b>f</b>) are the test results of different samples.</p>
Full article ">Figure 16
<p>Unet-Enhanced heat map. (<b>a</b>–<b>f</b>) are the test results of different samples.</p>
Full article ">
27 pages, 59331 KiB  
Article
AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation
by Taisei Hanyu, Kashu Yamazaki, Minh Tran, Roy A. McCann, Haitao Liao, Chase Rainwater, Meredith Adkins, Jackson Cothren and Ngan Le
Remote Sens. 2024, 16(16), 2930; https://doi.org/10.3390/rs16162930 - 9 Aug 2024
Cited by 4 | Viewed by 2314
Abstract
When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model [...] Read more.
When performing remote sensing image segmentation, practitioners often encounter various challenges, such as a strong imbalance in the foreground–background, the presence of tiny objects, high object density, intra-class heterogeneity, and inter-class homogeneity. To overcome these challenges, this paper introduces AerialFormer, a hybrid model that strategically combines the strengths of Transformers and Convolutional Neural Networks (CNNs). AerialFormer features a CNN Stem module integrated to preserve low-level and high-resolution features, enhancing the model’s capability to process details of aerial imagery. The proposed AerialFormer is designed with a hierarchical structure, in which a Transformer encoder generates multi-scale features and a multi-dilated CNN (MDC) decoder aggregates the information from the multi-scale inputs. As a result, information is taken into account in both local and global contexts, so that powerful representations and high-resolution segmentation can be achieved. The proposed AerialFormer was benchmarked on three benchmark datasets, including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that the proposed AerialFormer remarkably outperforms state-of-the-art methods. Full article
(This article belongs to the Special Issue Deep Learning and Computer Vision in Remote Sensing-III)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Examples of challenging characteristics in remote sensing image segmentation. (<b>Left</b>) (i) The distribution of the foreground and background is highly imbalanced (black). (<b>Top right</b>) Objects in some classes are (ii) tiny (yellow) and (iii) dense (orange) so that they are hardly identifiable. (<b>Bottom right</b>) Within a class, there is a large diversity in appearance: (iv) intra-class heterogeneity (purple); some different classes share the similar appearance: (v) inter-class homogeneity (pink). The image is from the iSAID dataset, best viewed in color.</p>
Full article ">Figure 2
<p>Overall network architecture of the proposed AerialFormer, which consists of three components, i.e., CNN Stem, Transformer Encoder, and multi-dilated CNN decoder.</p>
Full article ">Figure 3
<p>Illustrations of the CNN Stem. The Stem takes the input image and produces feature maps with half of the original spacial resolution.</p>
Full article ">Figure 4
<p>The figure illustrates a Transformer Encoder Block, showcasing how it employs localized attention within shifting windows to progressively capture the global context as the network depth increases.</p>
Full article ">Figure 5
<p>An illustration of the MDC Block, which consists of Pre-Channel Mixer, DCL, and Post-Channel Mixer.</p>
Full article ">Figure 6
<p>The qualitative ablation study on the CNN Stem and multi-dilated CNN decoder on iSAID dataset. By comparing Swin-Unet (Baseline) and MDC-Only with Stem-Only and AerialFormer, it is clear that our Stem module helps to segment the small and dense objects, highlighted by the red line.</p>
Full article ">Figure 7
<p>Qualitative comparison between AerialFormer and PSPNet [<a href="#B91-remotesensing-16-02930" class="html-bibr">91</a>] and DeepLabV3+ [<a href="#B24-remotesensing-16-02930" class="html-bibr">24</a>] in terms of foreground–background imbalance. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.</p>
Full article ">Figure 8
<p>Qualitative comparison between AerialFormer and PSPNet [<a href="#B91-remotesensing-16-02930" class="html-bibr">91</a>] and DeepLabV3+ [<a href="#B24-remotesensing-16-02930" class="html-bibr">24</a>] in terms of tiny objects. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines. Some of the objects that are evident in the input are ignored in the ground truth label.</p>
Full article ">Figure 9
<p>Qualitative comparison between AerialFormer and PSPNet [<a href="#B91-remotesensing-16-02930" class="html-bibr">91</a>] and DeepLabV3+ [<a href="#B24-remotesensing-16-02930" class="html-bibr">24</a>] in terms of dense objects. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.</p>
Full article ">Figure 10
<p>Qualitative comparison between AerialFormer and PSPNet [<a href="#B91-remotesensing-16-02930" class="html-bibr">91</a>] and DeepLabV3+ [<a href="#B24-remotesensing-16-02930" class="html-bibr">24</a>] in terms of intra-class heterogeneity: the regions highlighted in the box are both classified under the ‘Agriculture’ category. However, one region features green lands, while the other depicts greenhouses. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.</p>
Full article ">Figure 11
<p>Qualitative comparison between AerialFormer and PSPNet [<a href="#B91-remotesensing-16-02930" class="html-bibr">91</a>] and DeepLabV3+ [<a href="#B24-remotesensing-16-02930" class="html-bibr">24</a>] in terms of inter-class homogeneity: the regions highlighted in the box share similar visual characteristics but one region is classified as a ‘Building’ while the other is classified as belonging to the ‘Low Vegetation’ category. From left to right are the original image, ground truth, PSPNet, DeepLabV3+, and AerialFormer. The first row shows the overall performances, and the second row shows zoomed-in regions. The corresponding regions in the first row are highlighted with a red frame, and the zoomed-in regions in the second row are connected to their respective locations in the first row with red lines.</p>
Full article ">Figure 12
<p>Qualitative comparison on various datasets: (<b>a</b>) iSAID, (<b>b</b>) Potsdam, and (<b>c</b>) LoveDA. From left to right: original image, ground truth, PSPNet, DeeplabV3+, and the proposed AerialFormer. The major difference are highlighted in red boxes.</p>
Full article ">
15 pages, 1112 KiB  
Article
ALKU-Net: Adaptive Large Kernel Attention Convolution Network for Lung Nodule Segmentation
by Juepu Chen, Shuxian Liu and Yulong Liu
Electronics 2024, 13(16), 3121; https://doi.org/10.3390/electronics13163121 - 7 Aug 2024
Viewed by 1467
Abstract
The accurate segmentation of lung nodules in computed tomography (CT) images is crucial for the early screening and diagnosis of lung cancer. However, the heterogeneity of lung nodules and their similarity to other lung tissue features make this task more challenging. By using [...] Read more.
The accurate segmentation of lung nodules in computed tomography (CT) images is crucial for the early screening and diagnosis of lung cancer. However, the heterogeneity of lung nodules and their similarity to other lung tissue features make this task more challenging. By using large receptive fields from large convolutional kernels, convolutional neural networks (CNNs) can achieve higher segmentation accuracies with fewer parameters. However, due to the fixed size of the convolutional kernel, CNNs still struggle to extract multi-scale features for lung nodules of varying sizes. In this study, we propose a novel network to improve the segmentation accuracy of lung nodules. The network integrates adaptive large kernel attention (ALK) blocks, employing multiple convolutional layers with variously sized convolutional kernels and expansion rates to extract multi-scale features. A dynamic selection mechanism is also introduced to aggregate the multi-scale features obtained from variously sized convolutional kernels based on selection weights. Based on this, we propose a lightweight convolutional neural network with large convolutional kernels, called ALKU-Net, which integrates the ALKA module in a hierarchical encoder and adopts a U-shaped decoder to form a novel architecture. ALKU-Net efficiently utilizes the multi-scale large receptive field and enhances the model perception capability through spatial attention and channel attention. Extensive experiments demonstrate that our method outperforms other state-of-the-art models on the public dataset LUNA-16, exhibiting considerable accuracy in the lung nodule segmentation task. Full article
(This article belongs to the Special Issue Deep Learning in Image Processing and Segmentation)
Show Figures

Figure 1

Figure 1
<p>The architecture of the ALKA. Feature maps <math display="inline"><semantics> <msub> <mi>X</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>X</mi> <mn>2</mn> </msub> </semantics></math> are extracted by 3 × 3 × 3 DWConv and 5 × 5 × 5 DWConv from input features <span class="html-italic">X</span>, respectively. The dynamic selection values <math display="inline"><semantics> <msub> <mi>W</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>W</mi> <mn>2</mn> </msub> </semantics></math> are generated to calibrate features <math display="inline"><semantics> <msub> <mi>X</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>X</mi> <mn>2</mn> </msub> </semantics></math>.</p>
Full article ">Figure 2
<p>The architecture of the ALKA block. The symbol * denotes the multiplication operation.</p>
Full article ">Figure 3
<p>Overview of the proposed ALKU-Net.</p>
Full article ">Figure 4
<p>Examples of the preprocessed images in LUNA-16 datasets. (<b>a</b>–<b>c</b>) show samples of the original CT image, the preprocessed lung region, and the mask image, respectively.</p>
Full article ">Figure 5
<p>Loss curves and Dice scores of ALKU-Net during training on the LUNA-16 dataset.</p>
Full article ">Figure 6
<p>Five-fold cross-validation results on the LUNA-16 dataset, comparing models based on the Dice Similarity Coefficient (DSC).</p>
Full article ">Figure 7
<p>Two-dimensional (2D) visual comparison of different lung nodule segmentations. From left to right are the original image (<b>a</b>), ground truths (<b>b</b>), and the segmentation results of 3D U-Net (<b>c</b>), STU-Net (<b>d</b>), 3D UX-Net (<b>e</b>), nnFormer (<b>f</b>), UNETR (<b>g</b>), SwinUNETR (<b>h</b>), and our ALKU-Net (<b>i</b>).</p>
Full article ">Figure 8
<p>Three-dimensional (3D) visual comparison of different lung nodule segmentations. From left to right are ground truths (<b>a</b>), and the segmentation results of 3D U-Net (<b>b</b>), STU-Net (<b>c</b>), 3D UX-Net (<b>d</b>), nnFormer (<b>e</b>), UNETR (<b>f</b>), SwinUNETR (<b>g</b>), and our ALKU-Net (<b>h</b>).</p>
Full article ">
16 pages, 483 KiB  
Article
Query-Based Object Visual Tracking with Parallel Sequence Generation
by Chang Liu, Bin Zhang, Chunjuan Bo and Dong Wang
Sensors 2024, 24(15), 4802; https://doi.org/10.3390/s24154802 - 24 Jul 2024
Viewed by 776
Abstract
Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In [...] Read more.
Query decoders have been shown to achieve good performance in object detection. However, they suffer from insufficient object tracking performance. Sequence-to-sequence learning in this context has recently been explored, with the idea of describing a target as a sequence of discrete tokens. In this study, we experimentally determine that, with appropriate representation, a parallel approach for predicting a target coordinate sequence with a query decoder can achieve good performance and speed. We propose a concise query-based tracking framework for predicting a target coordinate sequence in a parallel manner, named QPSTrack. A set of queries are designed to be responsible for different coordinates of the tracked target. All the queries jointly represent a target rather than a traditional one-to-one matching pattern between the query and target. Moreover, we adopt an adaptive decoding scheme including a one-layer adaptive decoder and learnable adaptive inputs for the decoder. This decoding scheme assists the queries in decoding the template-guided search features better. Furthermore, we explore the use of the plain ViT-Base, ViT-Large, and lightweight hierarchical LeViT architectures as the encoder backbone, providing a family of three variants in total. All the trackers are found to obtain a good trade-off between speed and performance; for instance, our tracker QPSTrack-B256 with the ViT-Base encoder achieves a 69.1% AUC on the LaSOT benchmark at 104.8 FPS. Full article
Show Figures

Figure 1

Figure 1
<p>Comparison of proposed tracking framework with other representative trackers. Most trackers follow the tracking pipeline in (<b>a</b>). Our tracking framework, shown in (<b>b</b>), is a query-based tracking pipeline that adopts a parallel and adaptive decoder. The target is represented by a sequence of queries; each query is responsible for a coordinate, and all queries are predicted in parallel.</p>
Full article ">Figure 2
<p>Overall architecture of the proposed tracker. The core component is the encoder–decoder transformer. Four additional queries, which represent four tokens of the target coordinates sequence, are fed into the encoder with the template–search pair. Then, the output queries are sent to the decoder as adaptive inputs. Finally, the adaptive decoder will decode the visual features to the queries and the prediction MLP will predict the target’s coordinate sequence.</p>
Full article ">Figure 3
<p>(<b>a</b>) Details of each encoder layer. (<b>b</b>) Details of the adaptive decoder. The generated parameters are dependent on each adaptive query for spatial and channel mixing.</p>
Full article ">Figure 4
<p>AUCs for different attributes on LaSOT [<a href="#B25-sensors-24-04802" class="html-bibr">25</a>]. Our tracker can be seen to be competitive in multiple attributes, especially with respect to the ‘Fast Motion’ attribute.</p>
Full article ">Figure 5
<p>Visualization of the attention weights of the search region corresponding to each query token. Target in the search region is in green bounding box.</p>
Full article ">Figure 6
<p>Impact of <math display="inline"><semantics> <msub> <mi>C</mi> <mrow> <mi>o</mi> <mi>u</mi> <mi>t</mi> </mrow> </msub> </semantics></math> in spatial mixing on LaSOT [<a href="#B25-sensors-24-04802" class="html-bibr">25</a>]. AUC and normalized precision are reported separately.</p>
Full article ">Figure 7
<p>Speed and performance comparison on LaSOT [<a href="#B25-sensors-24-04802" class="html-bibr">25</a>]. Our QPSTrack-Light achieved 140.6 fps while exceeding HCAT [<a href="#B45-sensors-24-04802" class="html-bibr">45</a>] by 2.0% AUC.</p>
Full article ">Figure 8
<p>AUCs of different attributes on LaSOT [<a href="#B25-sensors-24-04802" class="html-bibr">25</a>] compared with other lightweight trackers. Our tracker showed significant advantages in the ‘Viewpoint Change’, ‘Out-of-View,’ and ‘Fast Motion’ attributes.</p>
Full article ">
20 pages, 27344 KiB  
Article
DeMambaNet: Deformable Convolution and Mamba Integration Network for High-Precision Segmentation of Ambiguously Defined Dental Radicular Boundaries
by Binfeng Zou, Xingru Huang, Yitao Jiang, Kai Jin and Yaoqi Sun
Sensors 2024, 24(14), 4748; https://doi.org/10.3390/s24144748 - 22 Jul 2024
Cited by 3 | Viewed by 1519
Abstract
The incorporation of automatic segmentation methodologies into dental X-ray images refined the paradigms of clinical diagnostics and therapeutic planning by facilitating meticulous, pixel-level articulation of both dental structures and proximate tissues. This underpins the pillars of early pathological detection and meticulous disease progression [...] Read more.
The incorporation of automatic segmentation methodologies into dental X-ray images refined the paradigms of clinical diagnostics and therapeutic planning by facilitating meticulous, pixel-level articulation of both dental structures and proximate tissues. This underpins the pillars of early pathological detection and meticulous disease progression monitoring. Nonetheless, conventional segmentation frameworks often encounter significant setbacks attributable to the intrinsic limitations of X-ray imaging, including compromised image fidelity, obscured delineation of structural boundaries, and the intricate anatomical structures of dental constituents such as pulp, enamel, and dentin. To surmount these impediments, we propose the Deformable Convolution and Mamba Integration Network, an innovative 2D dental X-ray image segmentation architecture, which amalgamates a Coalescent Structural Deformable Encoder, a Cognitively-Optimized Semantic Enhance Module, and a Hierarchical Convergence Decoder. Collectively, these components bolster the management of multi-scale global features, fortify the stability of feature representation, and refine the amalgamation of feature vectors. A comparative assessment against 14 baselines underscores its efficacy, registering a 0.95% enhancement in the Dice Coefficient and a diminution of the 95th percentile Hausdorff Distance to 7.494. Full article
(This article belongs to the Special Issue Biomedical Imaging, Sensing and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Schematic representation of the Deformable Convolution and Mamba Integration Network (DeMambaNet), integrating a Coalescent Structural Deformable Encoder, a Cognitively-Optimized Semantic Enhance Module, and a Hierarchical Convergence Decoder.</p>
Full article ">Figure 2
<p>Schematic representation of the CSDE, which integrates a State Space Pathway, based on SSM, in the upper section, and an Adaptive Deformable Pathway, based on DCN, in the lower section.</p>
Full article ">Figure 3
<p>The schematic depiction of each hierarchical stage, composed of DCNv3, LN, and MLP, utilizes DCNv3 as its core operator for efficient feature extraction.</p>
Full article ">Figure 4
<p>The schematic depiction of the TSMamba block involves GSC, ToM, LN, and MLP, collectively enhancing input feature processing and representation.</p>
Full article ">Figure 5
<p>The schematic depiction of the SEM, which combines encoder outputs through concatenation, applies Conv, BN, and ReLU and then enhances features with the MLP and LVC. The MLP captures global dependencies, while the LVC focuses on local details.</p>
Full article ">Figure 6
<p>The schematic illustrates the HCD, incorporating a multi-layered decoder structure. Each tier combines convolutional and deconvolutional layers for feature enhancement and upsampling, and it is equipped with the TAFI designed specifically for feature fusion.</p>
Full article ">Figure 7
<p>Schematic representation of the TAFI, which combines features from the encoder’s two pathways and uses local and global attention modules to emphasize important information.</p>
Full article ">Figure 8
<p>Box plot showcasing the evaluation metrics index from training results. On the x-axis, the models are labeled as follows: (a) ENet; (b) ICNet; (c) LEDNet; (d) OCNet; (e) PSPNet; (f) SegNet; (g) VM-UNet; (h) Attention U-Net; (i) R2U-Net; (j) UNet; (k) UNet++; (l) TransUNet; (m) Dense-UNet; (n) Mamba-UNet; (o) DeMambaNet (ours).</p>
Full article ">Figure 9
<p>A few segmentation results of comparison between our proposed method and the existing state-of-the-art models. The segmentation result of the teeth is shown in green. The red dashed line represents the ground truth.</p>
Full article ">Figure 10
<p>Box plot showcasing the evaluation metrics index of ablation experiments. On the x-axis, the models are labeled as follows: (a) w/o SSP; (b) w/o ADP; (c) w/o TAFI; (d) w/o SEM; (e) DeMambaNet (ours).</p>
Full article ">
27 pages, 2909 KiB  
Article
Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model
by Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang and Hang Feng
ISPRS Int. J. Geo-Inf. 2024, 13(7), 260; https://doi.org/10.3390/ijgi13070260 - 21 Jul 2024
Viewed by 1295
Abstract
Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text [...] Read more.
Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively. Full article
(This article belongs to the Topic Geocomputation and Artificial Intelligence for Mapping)
Show Figures

Figure 1

Figure 1
<p>General idea of the paper.</p>
Full article ">Figure 2
<p>HTM structure [<a href="#B57-ijgi-13-00260" class="html-bibr">57</a>,<a href="#B58-ijgi-13-00260" class="html-bibr">58</a>]. (<b>A</b>) HTM has a three-level hierarchy. The smallest unit is an HTM cell. In each layer, there are a large number of cells, multiple cells form mini-columns, and multiple mini-columns form regions. (<b>B</b>) The end-to-end HTM system includes an encoder, HTM SP, HTM TM, and a classifier. (<b>C</b>) An HTM neuron has one proximal dendrite and several distal dendrites, and dendrites have different functions. Proximal dendrites receive feedforward inputs, while distal dendrites receive contextual information from nearby cells in the layer. (<b>D</b>) All cells in the same mini-column share the same synapses that receive feedforward inputs, which means they receive the same information. (<b>E</b>) Each layer of the HTM model consists of several mini-columns of cells that can read and form synaptic connections with input data.</p>
Full article ">Figure 3
<p>Example of encoding words in names of a geoscientific dataset using our method.</p>
Full article ">Figure 4
<p>Structure of HTM spatial pooler [<a href="#B58-ijgi-13-00260" class="html-bibr">58</a>].</p>
Full article ">Figure 5
<p>Structure of HTM temporal memory [<a href="#B63-ijgi-13-00260" class="html-bibr">63</a>,<a href="#B64-ijgi-13-00260" class="html-bibr">64</a>]. Cells in the TM process can exist in three states: inactive, active, or predictive. When a cell does not receive any feedforward input, it is in an inactive state (purple triangle), and when it receives feedforward input, it is in an active state (green triangle). Sufficient lateral activity in a contextual dendrite leads to a predictive state (red triangle).</p>
Full article ">Figure 6
<p>Prediction accuracy with different model sizes and training iteration times.</p>
Full article ">Figure 7
<p>Comparison of the five methods on precision, recall, and F1-score.</p>
Full article ">
18 pages, 2763 KiB  
Article
Short-Term Power Load Forecasting Using a VMD-Crossformer Model
by Siting Li and Huafeng Cai
Energies 2024, 17(11), 2773; https://doi.org/10.3390/en17112773 - 5 Jun 2024
Viewed by 1078
Abstract
There are several complex and unpredictable aspects that affect the power grid. To make short-term power load forecasting more accurate, a short-term power load forecasting model that utilizes the VMD-Crossformer is suggested in this paper. First, the ideal number of decomposition layers was [...] Read more.
There are several complex and unpredictable aspects that affect the power grid. To make short-term power load forecasting more accurate, a short-term power load forecasting model that utilizes the VMD-Crossformer is suggested in this paper. First, the ideal number of decomposition layers was ascertained using a variational mode decomposition (VMD) parameter optimum approach based on the Pearson correlation coefficient (PCC). Second, the original data was decomposed into multiple modal components using VMD, and then the original data were reconstructed with the modal components. Finally, the reconstructed data were input into the Crossformer network, which utilizes the cross-dimensional dependence of multivariate time series (MTS) prediction; that is, the dimension-segment-wise (DSW) embedding and the two-stage attention (TSA) layer were designed to establish a hierarchical encoder–decoder (HED), and the final prediction was performed using information from different scales. The experimental results show that the method could accurately predict the electricity load with high accuracy and reliability. The MAE, MAPE, and RMSE were 61.532 MW, 1.841%, and 84.486 MW, respectively, for dataset I. The MAE, MAPE, and RMSE were 68.906 MW, 0.847%, and 89.209 MW, respectively, for dataset II. Compared with other models, the model in this paper predicted better. Full article
(This article belongs to the Section F: Electrical Engineering)
Show Figures

Figure 1

Figure 1
<p>DSW embedding.</p>
Full article ">Figure 2
<p>Two-stage attention layer.</p>
Full article ">Figure 3
<p>Architecture of hierarchical encoder–decoder.</p>
Full article ">Figure 4
<p>Model prediction process.</p>
Full article ">Figure 5
<p>Original data: (<b>a</b>) I dataset; (<b>b</b>) II dataset. The datasets were split in a 7:1:2 ratio between training, validation, and testing. By training the model on the training set, the model could learn and recognize the laws and patterns of the time series data; by performing model evaluation and adjusting the parameters on the validation set, the model’s predictive capacity could be further enhanced to raise the model’s accuracy and stability; and by assessing the predicted outcomes of the model on a test set, the model’s generalizability was objectively assessed and its ability to accurately forecast electric loads for real-world applications was determined.</p>
Full article ">Figure 5 Cont.
<p>Original data: (<b>a</b>) I dataset; (<b>b</b>) II dataset. The datasets were split in a 7:1:2 ratio between training, validation, and testing. By training the model on the training set, the model could learn and recognize the laws and patterns of the time series data; by performing model evaluation and adjusting the parameters on the validation set, the model’s predictive capacity could be further enhanced to raise the model’s accuracy and stability; and by assessing the predicted outcomes of the model on a test set, the model’s generalizability was objectively assessed and its ability to accurately forecast electric loads for real-world applications was determined.</p>
Full article ">Figure 6
<p>VMD decomposition results: (<b>a</b>) I dataset; (<b>b</b>) II dataset.</p>
Full article ">Figure 7
<p>Comparison of simple model predictions: (<b>a</b>) I dataset; (<b>b</b>) II dataset.</p>
Full article ">Figure 8
<p>Comparison of complex model predictions: (<b>a</b>) I dataset; (<b>b</b>) II dataset.</p>
Full article ">
Back to TopTop