1. Introduction
Diabetic Retinopathy (DR) is the primary cause of blindness and a major hazard to the working-age population worldwide [
1]. The frequency of visual impairment, which frequently results from postponed follow-ups and referrals, highlights the importance of prompt screening and diagnosis [
2]. Computer-aided systems offer a viable way to increase screening resources and give physicians more time management skills [
3].
Retinal vessel segmentation plays a pivotal role in evaluating vein ratios [
4], analyzing blood flow [
5], assessing image quality [
6], and supporting applications like retinal image registration [
7] and synthesis [
8]. Early methods in this domain were unsupervised, relying on conventional image-processing operations [
9,
10]. However, their limitations in handling pathological structures and adapting to diverse image appearances led to a shift towards learning-based approaches [
11,
12].
An essential function of retinal vascular morphology is to detect and monitor the development of diabetic retinopathy [
13]. Research confirms that measuring metrics related to vascular structure can be used as predictive markers of the course and intensity of a disease [
14]. Retinal vessels are often the site of important DR symptoms such as hemorrhages and microaneurysms, which emphasizes the significance of accurate vascular segmentation [
15].
Earlier methods required ophthalmologists to manually segment the data, which was a tedious and time-consuming procedure and involved human errors [
16]. Artificial Intelligence (AI) has revolutionized medical image processing by introducing advanced techniques that significantly enhance diagnostic accuracy and efficiency [
12,
16,
17]. Because they can automatically learn discriminative features, machine learning (ML) techniques, in particular, Convolutional Neural Networks (CNNs), have become increasingly popular for retinal vascular segmentation [
12,
18]. Within the field of DL-based methods, U-Net [
19], and its variations have demonstrated effectiveness in medical picture segmentation applications, such as retinal vascular segmentation. Improved segmentation accuracy has been attributed to several models, including the joint loss proposed by Yan, Yang, and Cheng [
20] and the vascular segmentation network by Gu et al. [
21]. Semantic information capture has been increased by models such as DEU-Net [
22] and DeepVessel [
23], which use multi-scale convolution blocks.
Variants of encoder-decoder networks have been successful, but maintaining feature information between these parts is still a problem [
24]. For the purpose of retaining spatial information, feature extraction techniques at the encoder and feature fusion approaches at the decoder have been investigated [
12]. Although U-shaped generators and adversarial learning have been studied, more research is needed to integrate these design phases in a methodical way for overall performance enhancement [
25].
Oktay et al. [
26] introduced Attention U-Net, which demonstrated how attention mechanisms can enhance segmentation accuracy by focusing on relevant features. Moreover, Qin et al. [
27] utilized a dual-attention mechanism to improve the segmentation of medical images, particularly in challenging regions. Recently, Dosovitskiy et al. [
28] applied Vision Transformers (ViT) to image classification, showing significant improvements in capturing global context compared to CNNs. Furthermore, Wang et al. [
29] extended transformers to dense prediction tasks like segmentation, highlighting their ability to model long-range dependencies. In 2017, Chen et al. [
30] proposed DeepLab, which incorporates atrous spatial pyramid pooling (ASPP) for capturing multi-scale context. Meanwhile, in 2017, Zhao et al. [
31] introduced PSPNet, emphasizing pyramid pooling to aggregate global context, improving segmentation performance in varying scales.
In this work, we introduce a pioneering framework for retinal vessel segmentation, specifically tailored for applications in DR diagnosis. We advance the field of retinal vessel segmentation for DR diagnosis by introducing a pioneering model that seamlessly integrates transformer blocks, a Joint Pyramid Upsampling (JPU) module, and a Convolutional Block Attention Module (CBAM) for enhanced feature extraction and representation. Our model goes beyond conventional approaches, leveraging the power of attention mechanisms and joint pyramid upsampling to capture intricate vessel structures across multiple scales. Furthermore, we contribute a comprehensive evaluation of the model’s performance on benchmark datasets, demonstrating its effectiveness in addressing key challenges such as intensity inhomogeneity and variations in vessel characteristics. This work not only introduces a novel segmentation model but also provides a significant stride toward accurate and efficient automated diabetic retinopathy screening. The key contributions of this work are as follows,
We introduced transformer blocks into the model architecture to facilitate contextualized feature extraction, allowing the model to capture long-range dependencies and improve feature representations.
The addition of the Joint Pyramid Upsampling (JPU) module enhances multi-scale feature fusion, enabling the model to effectively capture information across different scales and improve segmentation accuracy.
Channel-Attention and Spatial-Attention Mechanisms (CBAM) are integrated into the model to enhance channel and spatial attention, enabling the model to adaptively recalibrate feature maps and focus on relevant spatial regions, therefore improving segmentation performance.
Our proposed framework achieves superior accuracy in retinal vessel segmentation, particularly for diabetic retinopathy diagnosis, surpassing existing methods in terms of performance metrics such as Mean IoU, Recall, and F1 Score.
2. Literature Review
Automatic retinal vascular segmentation techniques were first established with the introduction of AI [
32]. These techniques were initially unsupervised but eventually developed into supervised techniques by including labeled data. The literature surrounding retinal vessel segmentation underscores its critical role in facilitating accurate visualization, diagnosis, early treatment, and surgical planning for ocular diseases. Retinal vascular segmentation techniques are now performing at the cutting edge thanks to recent developments in deep learning. However, there are still issues because of the wide variations in vessel morphology versus noisy backgrounds. These issues include low discriminative ability in the optic disk area and trouble correctly identifying small, thin vessels. Xiao et al., in their study, present an improved U-Net-like model with a skip connection technique and a weighted attention strategy to address these issues [
33]. Experiments carried out on two benchmark datasets confirm that the suggested method is effective in enhancing the accuracy of retinal vascular segmentation and resolving the previously described issues.
Pyramid U-Net offers a solution by leveraging Pyramid-Scale Aggregation Blocks (PSABs) to aggregate features across multiple scales, yet challenges in segmenting capillaries persist despite these advancements [
34]. The ARP-Net represents a significant advancement in retinal vessel segmentation by integrating the Adaptive Gated Axial Transformer (AGAT) and Residual and Point repair modules, enabling more accurate delineation of retinal vascular structures [
35]. The PCAT-UNet [
36] represents a significant advancement in retinal vessel segmentation by combining the global computing features of Transformer with the local feature extraction capabilities of CNNs, therefore addressing the limitations of existing methods in capturing both global dependence relationships and local feature details. The limitation of PCAT-UNet could be its susceptibility to overfitting, especially when trained on datasets with limited variability or insufficient sample diversity. This could result in the model’s reduced ability to generalize to unseen data or variations in imaging conditions, impacting its performance in real-world applications where such variability is common.
Recent advancements in deep learning have significantly enhanced blood vessel segmentation techniques. Furthermore, the exploration of the usage of attention-based models and multi-scale features has allowed the neural network models to give better performance [
37,
38,
39]. Zhou et al. (2023) introduced the Multi-scale Context Dense Aggregation Network (MCDAU-Net), which tackles issues like inadequate sample sizes and neglect of contextual information by utilizing concentric patches and incorporating a Cascaded Dilated Spatial Pyramid Pooling (CDSPP) module to enhance the receptive field. The InceptionConv (IConv) module explores deeper semantic features, and the Multi-scale Adaptive Feature Aggregation (MAFA) module fuses multi-scale features through adaptive weight coefficients, achieving superior segmentation performance [
40]. Liu et al. (2022) proposed ResDO-UNet, integrating residual depth-wise over-parameterized convolutional layers (ResDO-conv), a pooling fusion block (PFB) to combine max and average pooling layers, and an attention fusion block (AFB) for effective multi-scale feature expression, enhancing the extraction capabilities and reducing information loss caused by multiple pooling operations [
41]. The MTPA_Unet architecture segments retinal vessels by leveraging both the local feature extraction capabilities of CNN and the long-distance dependency modeling strengths of transformers [
42]. However, the scalability of MTPA_Unet may be limited when dealing with images of varying resolutions or sizes, as the network architecture may not readily adapt to such changes, potentially leading to decreased segmentation accuracy.
Xie et al. (2024) developed ARSA-UNet, introducing a structure adaptive layer to enhance convolution performance and an atrous residual path to dynamically adjust the receptive field, combined with a feature-screening fused module in the decoder and a multi-scale deep supervised mechanism for efficient training and effective fusion of layer information [
43]. Zhang et al. (2024) proposed SMP-Net with a sequencer-convolution (SC) module to extract both local and global features, a residual multi-kernel pooling (MP) module to maintain spatial information, and a pixel attention (PA) module to improve vascular feature identification [
44]. Fang et al. (2024) introduced a model combining deep convolutional networks with biological visual feature extraction mechanisms, using Gabor functions to simulate visual pathways and an adaptive optimization scheme with hybrid loss function weights to enhance small vessel information and overall semantic richness [
45].
While existing methods have made significant strides in retinal vessel segmentation, they often face challenges in capturing both local details and global dependencies efficiently. The proposed novel framework addresses these limitations by integrating the strengths of CNNs and transformer architectures, offering a promising avenue for improved retinal vessel segmentation.
3. Dataset
Retinal vessel segmentation, a critical task in diagnosing retinal diseases, demands robust datasets for algorithm development and evaluation. Among the available datasets, DRIVE [
46], CHASEDB1 [
47], and STARE [
48] are notable for their extensive annotations and suitability for retinal vessel segmentation tasks. In our study, we have also incorporated three additional datasets, HRF [
49], IOSTAR [
11], and LES-AV [
5] to ensure a comprehensive evaluation.
The DRIVE dataset consists of 40 high-resolution fundus images, with seven exhibiting abnormal retinal conditions. These images, captured using a fundus camera with a 45-degree field of view, provide diverse characteristics for training and testing segmentation algorithms. Similarly, the CHASEDB1 dataset, comprises 28 fundus images, with annotation for vessel segmentation tasks. These images, captured using various imaging devices, offer valuable insights into retinal pathology and vessel morphology.
The STARE (Structured Analysis of the Retina) dataset comprises 20 eye fundus images with a resolution of 700 × 605 pixels. It includes a comprehensive collection of data, such as diagnosis codes for each image, expert annotations of visible manifestations, and detailed feature mappings. The dataset contains artery/vein labeling by two experts. The expert labeling makes the STARE dataset a valuable tool for retinal image analysis and machine learning applications.
The HRF (High-Resolution Fundus) dataset is tailored for retinal vessel segmentation, consisting of 45 images divided into 15 subsets. Each subset includes three types of images: one from a healthy eye, one from an individual with diabetic retinopathy, and one from a glaucoma patient. The images are high-resolution, measuring 3304 × 2336 pixels. The dataset is organized with 22 images for training and 23 for testing, providing a diverse set of images to assess segmentation models across various retinal conditions.
The LES-AV dataset consists of 22 fundus images, each annotated manually for blood vessels, with separate labels for arteries and veins. Meanwhile, the IOSTAR vessel segmentation dataset comprises 30 high-resolution images, each with a resolution of 1024 × 1024 pixels. The dataset features expert annotations for retinal vessels, the optic disc, and the artery/vein ratio, making it a comprehensive resource for studying and evaluating vessel segmentation techniques in retinal image analysis.
For our experiments, we meticulously divided the images into training and testing splits. Specifically, we allocated 20 images from the DRIVE dataset for training and reserved the remaining 20 for testing. In the case of the CHASEDB1 dataset, we allocated 14 images for training and used the remaining 14 for testing. To enhance the robustness of our models, we applied data augmentation techniques exclusively during the training phase, as shown in
Figure 1. This approach ensures that our models learn from augmented variations of the training data, therefore improving their generalization capabilities. So, after data augmentation, 160 training images and 40 validation images were used for the DRIVE dataset and 112 training images and 28 validation images for the CHASEDB1 dataset. By adhering to these carefully designed training and testing splits, we aim to conduct comprehensive evaluations while maintaining consistency and rigor in our experimentation process.
4. Methodology
In this section, we present the methodology employed in our study, encompassing various components essential for the accurate segmentation of retinal vessels. We begin by detailing the data augmentation techniques utilized to enrich the training dataset and enhance model generalization. Subsequently, we delve into the architecture of our proposed model, comprising three key modules: Channel-Attention and Spatial-Attention (CBAM), Joint Pyramid Upsampling (JPU), and Transformer. Each module contributes uniquely to the overall framework, facilitating robust feature representation, context aggregation, and segmentation accuracy. Through a comprehensive exploration of these components, we aim to provide a thorough understanding of our methodology and its implications for retinal vessel segmentation tasks.
Data augmentation is a crucial technique employed to enhance the diversity and size of the training dataset, therefore improving the performance and robustness of deep-learning models for retinal vessel segmentation. In this work, we have developed a data augmentation pipeline implemented using Python and leveraging the OpenCV and imgaug libraries. The pipeline consists of several augmentation techniques applied to the input fundus images and their corresponding vessel segmentation masks.
First, we perform gamma correction to adjust the brightness and contrast of the images. Mathematically, gamma correction is defined as:
where
represents the original pixel intensity values, and
is the correction factor.
Next, we apply Contrast Limited Adaptive Histogram Equalization (CLAHE) to improve local contrast and enhance vessel visibility. CLAHE transforms the intensity distribution of the image to better utilize the dynamic range, therefore enhancing vessel features.
Furthermore, various geometric transformations are applied to simulate different imaging conditions and orientations. These include rotation, flipping, elastic deformation, motion blur, grid distortion, and optical distortion. Mathematically, these transformations can be represented as affine transformations:
where
represents the coordinates of the pixels in the original image,
represents the coordinates after transformation,
represents the transformation matrix, and
represents the translation vector.
Additionally, we incorporate median blur to reduce noise and random adjustments of brightness and contrast to simulate variations in imaging conditions.
Overall, this data augmentation pipeline results in a more diverse and representative training dataset, enabling our deep-learning models to generalize better to unseen data and improve the accuracy of retinal vessel segmentation.
4.1. Proposed Model
The retinal vessel segmentation model comprises three main modules: Channel-Attention and Spatial-Attention (CBAM), Joint Pyramid Upsampling (JPU), and Transformer. Each module plays a crucial role in enhancing feature representation, contextual information integration, and segmentation accuracy.
The integration of these modules within the retinal vessel segmentation model enables robust feature extraction, context aggregation, and segmentation accuracy enhancement.
Figure 2 illustrates the overall framework along with the architectures of individual modules. These modules are essential for effectively capturing vessel structures, enhancing feature representations, and accurately segmenting retinal vessels, thus facilitating clinical diagnosis and treatment monitoring in ophthalmology.
4.1.1. CBAM Module
The CBAM module incorporates both channel-wise and spatial-wise attention mechanisms to adaptively recalibrate feature maps. The channel-attention mechanism captures inter-channel dependencies, while the spatial-attention mechanism highlights informative spatial regions within each feature map. Mathematically, the channel-attention mechanism is defined as:
where
x represents the input feature map, MLP denotes a multi-layer perceptron, and
denotes the sigmoid activation function. The CBAM module enhances feature discriminability and suppresses irrelevant information, therefore improving segmentation accuracy. As learned from the literature [
50], we intentionally refrain from applying dimensionality reduction to preserve a direct relationship between the channels and their corresponding weights.
For better segmentation accuracy, the CBAM module dynamically recalibrates feature maps to focus on pertinent information using its dual-channel and spatial-attention methods. The module captures inter-channel dependencies and highlights informative spatial regions in each feature map by combining both channel-wise and spatial-wise attention techniques. While the spatial-attention mechanism uses convolutional operations to highlight spatially informative regions, the channel-attention mechanism uses a combination of average pooling and max pooling operations followed by multi-layer perceptrons to capture both global and local channel-wise dependencies. Attention maps are created using the sigmoid activation function to adjust feature responses adaptively, improving feature discriminability and decreasing irrelevant information.
4.1.2. Transformer Module
The Transformer module facilitates self-attention and context modeling within the network, enabling long-range dependency capture and feature refinement. It consists of multi-head attention mechanisms and positional encodings to capture global context information. The Transformer module is mathematically represented as:
where
Q,
K, and
V represent the query, key, and value matrices, respectively.
,
, and
denote the learnable weights for the
i-th head of the multi-head attention mechanism.
represents the output weight matrix. The Transformer module enhances feature representations by capturing long-range dependencies and contextual relationships, leading to improved vessel segmentation accuracy.
Feature refinement in segmentation tasks. The module gathers long-range dependencies and global context in the network using positional encodings and multi-head attention methods. Different areas of the input feature maps are attended to by each attention head, enabling parallel processing and effective acquisition of pertinent data at various scales. By calculating attention weights based on the similarity between the query and key vectors, the attention mechanism allows the network to suppress unimportant features and noise, allowing it to concentrate on informative regions. The Transformer module improves information flow throughout the network and guarantees stable training by implementing residual connections and layer normalization.
4.1.3. JPU Module
The JPU module facilitates multi-scale feature fusion and contextual information aggregation to refine feature representations. It integrates features from different pyramid scales and enhances feature resolution through parallel convolutional operations. The JPU module is defined as follows:
where conv1 × 1, conv2 × 2, and conv3 × 3 represent feature maps obtained from 1 × 1, 2 × 2, and 3 × 3 convolutional operations, respectively. The JPU module enriches feature representations with multi-scale contextual information, enabling accurate vessel segmentation across varying vessel widths and contrasts.
The JPU module’s Concatenate operation is essential for merging feature maps that were taken at several pyramid scales, adding multi-scale contextual information to the feature representations. The module integrates features from different fields of reception while preserving spatial information by concatenating the feature maps along the channel dimension. This allows strong vessel segmentation across a wide range of vessel thicknesses and intensities by allowing the model to incorporate contextual information at coarser scales and fine features at higher resolutions. Additionally, the model can learn discriminative features, which are necessary for precise segmentation of retinal vessels in fundus images, due to the parallel convolutional procedures (1 × 1, 2 × 2, and 3 × 3), which further improve feature resolution and variety. By enhancing feature discrimination and adaptation to intricate vessel architectures, the JPU module enhances the overall efficacy of the segmentation framework through the synergistic integration of multi-scale characteristics.
4.2. Integration of CBAM, Transformer, and JPU Modules
In our proposed retinal vessel segmentation framework, the collaboration between the CBAM, Transformer, and JPU modules plays a pivotal role in enhancing feature representation, contextual understanding, and segmentation accuracy. This subsection delves into the interactions between these modules and their collective impact on model performance.
The CBAM first processes the input feature maps by applying both channel and spatial-attention mechanisms. By recalibrating feature maps, CBAM emphasizes important features while suppressing irrelevant ones. The output from CBAM serves as enriched input to the Transformer module, which further processes these features by capturing long-range dependencies and contextual relationships. This synergy allows the Transformer to leverage the discriminative features enhanced by CBAM, resulting in a more robust feature representation that is essential for accurate segmentation.
Mathematically, the integration can be represented as follows:
Here, the output of the CBAM module is fed into the Transformer module, enhancing the contextual understanding of the input data.
Following the Transformer module, the Joint Pyramid Upsampling (JPU) module fuses multi-scale features to refine the segmentation output. The JPU leverages the feature representations processed by the Transformer, allowing it to effectively capture fine details while integrating contextual information across different scales. The fusion process in JPU utilizes the outputs from various convolutional operations to create a comprehensive feature representation:
In this equation, the features output from the Transformer are concatenated with the results from the JPU’s convolutional operations, enabling the model to effectively leverage both high-level semantic information and low-level spatial details. This integration is crucial for accurately segmenting vessels of varying sizes and intensities.
5. Implementation, Evaluation Details
This subsection provides an in-depth explanation of the details regarding the implementation of the proposed model and the figure of merits used to evaluate the proposed model.
5.1. Metrics for Retinal Vessel Segmentation
Retinal vessel segmentation is a critical task in medical image analysis aimed at extracting blood vessels from retinal fundus images. Various evaluation metrics are employed to assess the performance of segmentation algorithms. In this section, we describe the metrics commonly used for evaluating retinal vessel segmentation.
5.1.1. Mean Intersection over Union (Mean IOU)
Mean Intersection over Union (Mean IOU), also known as the Jaccard Index, measures the overlap between the predicted segmentation mask and the ground truth mask. It is defined as the ratio of the intersection area to the union area of the predicted and ground truth masks. The Mean IOU is calculated as follows:
where:
N is the total number of images.
is the predicted mask for the i-th image.
is the ground truth mask for the i-th image.
5.1.2. Recall
Recall, also known as Sensitivity or True Positive Rate (TPR), measures the ability of the segmentation algorithm to correctly detect all positive instances (vessels) from the ground truth. It is calculated as the ratio of true positive predictions to the total number of positive instances. Recall is given by:
where:
5.1.3. Precision
Precision measures the accuracy of the positive predictions made by the segmentation algorithm. It is calculated as the ratio of true positive predictions to the total number of positive predictions (both true positives and false positives). Precision is given by:
where:
5.1.4. F1 Score
F1 Score is the harmonic mean of precision and recall, providing a balanced measure of the segmentation algorithm’s performance. It is calculated as follows:
5.1.5. Specificity
Specificity measures the ability of the segmentation algorithm to correctly detect negative instances (background) from the ground truth. It is calculated as the ratio of true negative predictions to the total number of negative instances. Specificity is given by:
where:
5.1.6. Dice Score
Dice Score, also known as F1 Score in segmentation tasks, measures the similarity between the predicted segmentation mask and the ground truth mask. It is calculated as twice the ratio of the intersection area to the sum of areas of the predicted and ground truth masks. The Dice Score is given by:
where:
This section provides a comprehensive overview of the evaluation metrics used in retinal vessel segmentation, including their mathematical formulations. These metrics play a crucial role in quantitatively assessing the performance of segmentation algorithms and guiding further improvements in the field.
6. Result and Discussion
In this section, we incrementally added CBAM, Transformer, and JPU modules to our baseline model, observing improved segmentation accuracy with each addition. Comparisons with existing baseline models consistently showed superior performance of our approach on CHASEDB1 and DRIVE datasets, highlighting its efficacy in accurately classifying retinal vessels. We also discussed computational requirements, clinical implications, and strategies for managing complexity, emphasizing the model’s potential to enhance ophthalmology diagnostics.
6.1. Ablation Study
Three essential modules—Channel-Attention and Spatial-Attention (CBAM), Transformer, and Joint Pyramid Upsampling (JPU)—were gradually added to improve the baseline model. Every module was designed to tackle distinct problems in retinal vascular segmentation, including multi-scale feature fusion, contextual comprehension, and feature representation.
The CBAM module was included in the baseline architecture in the initial version of our model. Segmentation performance improved right away thanks to CBAM, which is renowned for its capacity to adaptively recalibrate feature maps by capturing inter-channel relationships and emphasizing interesting spatial regions. This first improvement showed how well attention processes work to improve feature representation and capture distinguishing traits.
By incorporating the Transformer module, we were able to further enhance the model’s performance, building on the success of CBAM. The Transformer module enabled self-attention and context modeling in the network, drawing inspiration from the Transformer architecture in natural language processing tasks. This resulted in modest improvements in segmentation accuracy as the model was able to capture long-range dependencies and improve feature representations.
We added the JPU module, along with CBAM and Transformer, to the model architecture in the last phase of our ablation investigation. The JPU module yielded further improvements in segmentation accuracy since it was made for multi-scale feature fusion and contextual information aggregation. The JPU module substantially enhanced the model’s capacity to capture complex vessel architectures and precisely segment retinal vessels by combining characteristics from several pyramid scales and improving feature representations across many scales.
Table 1 summarizes the effects of adding these modules to the baseline model. The table indicates that the addition of each module resulted in improvements in segmentation measures; the best results were obtained when all three modules were merged. These results highlight the combined benefits of using cutting-edge methods for context modeling and feature representation in retinal vascular segmentation tasks.
6.2. Comparison with Existing Baseline Models
We carried out a thorough comparison with current baseline models that are frequently used in the field of medical picture segmentation to assess the efficacy of our suggested retinal vascular segmentation model. These baseline models stand for cutting-edge methods or proven architectures for segmenting retinal vessels in images. We evaluate the progress made by integrating multi-scale feature fusion, contextual understanding, and attention mechanisms by contrasting our suggested model’s performance with these baseline models.
Table 2 presents the performance metrics of our proposed model compared to existing baseline models on the CHASEDB1, DRIVE, HRF, LES-AV, IOSTAR, and STARE datasets. The existing baseline models include ResU-Net, Attention U-Net, and DeepLabv3+, which have been extensively studied and benchmarked for segmentation tasks.
Using the six datasets, our suggested approach routinely outperforms the baseline segmentation models that are currently in use in terms of all performance parameters. Specifically, our model’s combination of attention mechanisms, contextual understanding, and multi-scale feature fusion shows notable gains in the Mean IoU, Recall, Precision, and F1 Score measures.
On the CHASEDB1 dataset, our proposed model achieves a Mean IoU of 0.8047, while ResU-Net and Attention U-Net achieve Mean IoUs of 0.7853 and 0.7872, respectively. Similarly, our model outperforms ResU-Net (0.7871) and Attention U-Net (0.7746) on the DRIVE dataset, with a Mean IoU of 0.8027. These findings demonstrate the efficiency of our suggested methodology in precisely classifying retinal vessels and its potential to raise the standard of ophthalmology diagnostics.
For the HRF dataset, our model achieves a Mean IoU of 0.7713, which is higher than the 0.7655 and 0.7662 achieved by ResU-Net and Attention U-Net, respectively. On the LES-AV dataset, our model records a Mean IoU of 0.8194, surpassing the 0.8153 and 0.7996 achieved by ResU-Net and Attention U-Net.
In the IOSTAR dataset, the proposed model achieves a Mean IoU of 0.8069, outperforming the 0.7977 and 0.7987 by ResU-Net and Attention U-Net, respectively. Finally, on the STARE dataset, our model achieves a Mean IoU of 0.8313, higher than 0.8194 by ResU-Net and 0.8249 by Attention U-Net.
6.3. Qualitative Analysis
Additionally, we conducted a visual comparison of segmentation results obtained from different baseline models and our proposed model.
Figure 3 illustrates the segmentation outputs for five different input retinal images from the CHASEDB1 dataset. Meanwhile,
Figure 4 illustrates the segmentation outputs for two different images from the ISO-STAR, HRF, and LES-AV datasets. The results from Attention U-Net, ResU-Net, DeepLabv3+, and our proposed model are compared side by side.
Significant variations in vascular connection and continuity across the various baseline models can be seen upon visual evaluation of the segmentation data. For example, DeepLab has poor vessel-to-vessel connection, leading to discontinuous and fragmented vessel segments. This is explained by the DeepLab architecture’s intrinsic constraints in capturing the contextual data and long-range dependencies required to keep vessels connected.
Even while vessel connectedness is better than in DeepLab, as demonstrated by the segmentation results from Attention U-Net, ResU-Net, and our suggested model, there are still some vessel fractures and discontinuities. Obstacles like vessel overlap, low contrast areas, or noisy image artifacts can cause these gaps and compromise the precision of segmentation algorithms.
6.4. Comparison with Existing Retina Segmentation Models
In this section, we compare the performance of our proposed model with several state-of-the-art retina segmentation models, specifically focusing on their specificity across three widely used datasets: DRIVE, CHASE_DB1, and STARE. The models included in the comparison are MCDAU-Net (2023) [
40], ResDO-UNet (2022) [
41], ARSA-UNet (2024) [
43], SMP-Net (2024) [
44], and Gabor-Net (2024) [
45]. Specificity is a crucial metric in segmentation tasks, as it measures the model’s ability to correctly identify the negative class, thus reducing the number of false positives.
The bar graph (
Figure 5) presents a comparative analysis of the specificity values for each model across the three datasets. Our proposed model achieves the highest specificity across all datasets. On the DRIVE dataset, the proposed model achieves a specificity of 0.9826, outperforming SMP-Net (2024), which records a specificity of 0.9813. Similarly, on the CHASE_DB1 dataset, our model records a specificity of 0.9892, significantly higher than the next best model, SMP-Net (2024), which has a specificity of 0.9876. On the STARE dataset, our proposed model again leads with a specificity of 0.981, surpassing ResDO-UNet (2022), which has a specificity of 0.9792.
These results clearly demonstrate the superiority of our proposed model over the existing models, highlighting its robustness and reliability in retinal vessel segmentation tasks. These results are illustrated in
Table 3 for STARE, DRIVE, and CHASE datasets. The high specificity values indicate that our model is highly effective in correctly identifying non-vessel regions, therefore reducing the incidence of false positives. This capability is crucial for enhancing the accuracy and reliability of diagnostic tools in ophthalmology.
The accompanying bar graph visually reinforces these comparisons, providing a clear and concise representation of the superior performance of our proposed model across the three datasets. The consistent outperformance of our model underscores its potential to set a new standard in retina segmentation, improving diagnostic outcomes in clinical settings.
6.5. Discussion
Compared to the other baseline models, our suggested model exhibits improved vascular connection and fewer instances of breakage. The enhanced preservation of vessel continuity and more robust vessel segmentation are made possible by the integrated attention mechanisms, contextual understanding, and multi-scale feature fusion, which are responsible for this improvement. The superior performance of our proposed model on all the datasets demonstrates its strong generalization capabilities. Achieving consistently high metrics across these two distinct datasets indicates the model’s robustness and potential applicability to various retinal imaging conditions and populations.
Transformers are known for their computational intensity due to their self-attention mechanisms, which scale quadratically with the input size. This section discusses the computational requirements and potential limitations of our proposed model. The proposed retinal vessel segmentation model, integrating CBAM, Transformer, and JPU modules, has demonstrated superior performance but at a computational cost. The Transformer module, in particular, involves multi-head self-attention and positional encoding, which increase the model’s complexity. To manage these computational demands, we utilized GPUs for training and inference.
The implementation was carried out on an NVIDIA RTX 3090 GPU, with training times averaging 10 h for 50 epochs on the DRIVE dataset. Memory usage peaked at 18 GB due to the Transformer module’s high-dimensional matrix operations. In terms of scalability, while the model performs well on standard datasets, applying it to higher-resolution images or real-time applications may require optimization strategies. These could include model pruning, quantization, or efficient transformer variants like Linformer or Performer, which aim to reduce the quadratic complexity of self-attention. Future work will focus on optimizing the model for real-time applications and deploying it on edge devices. This will involve reducing the model size and inference time without significantly compromising accuracy.
The improved segmentation performance of our model has significant clinical implications in ophthalmology. Accurate retinal vessel segmentation is crucial for diagnosing and monitoring various eye conditions, including diabetic retinopathy, glaucoma, and hypertensive retinopathy. Improved segmentation facilitates early detection of microaneurysms, hemorrhages, and neovascularization, enabling timely intervention. Precise vessel segmentation allows clinicians to monitor disease progression more accurately, adjusting treatments as necessary. The model’s high accuracy reduces the need for manual annotation, speeding up the diagnostic process and allowing clinicians to focus on treatment. Furthermore, Enhanced segmentation supports remote diagnosis, which is particularly beneficial in under-resourced regions, improving accessibility to eye care. We plan to collaborate with ophthalmologists to validate the model’s performance in clinical settings, assessing its impact on diagnostic accuracy and patient outcomes. This collaboration will help refine the model based on real-world feedback, ensuring its effectiveness in diverse clinical scenarios.
7. Conclusions
Here, we report the development of a unique retinal vascular segmentation model, which uses advanced deep-learning methods, including Transformer blocks, Joint Pyramid Upsampling, and Channel Attention, to be specifically designed for the diagnosis of diabetic retinopathy. After undergoing comprehensive tests on benchmark datasets, our model outperformed the most recent baseline models in terms of vascular continuity and connectivity, all while retaining a high degree of segmentation accuracy. The suggested methodology has a great deal of promise to automate the screening process for diabetic retinopathy, providing medical practitioners with accurate and effective tools for early diagnosis and treatment monitoring. Our methodology can reduce the workload on healthcare systems and accelerate patient care by optimizing the screening process.
Looking ahead, more studies will be done to improve the model’s overall performance and robustness. This entails looking at alternative techniques for data augmentation to enhance model generalization across various datasets and looking into ways to broaden the model’s applicability to include other retinal disorders in addition to diabetic retinopathy. Through iterative improvements and further development of our suggested framework, we hope to further the field of ophthalmology’s computer-aided diagnosis systems, ultimately leading to better patient outcomes and global standards for eye care.
Author Contributions
Conceptualization, H.-J.K., H.E., K.T.C.; methodology, H.-J.K., H.E., K.T.C.; software, H.-J.K., H.E., K.T.C.; validation, H.-J.K., H.E., K.T.C.; investigation, H.-J.K., H.E., K.T.C.; writing—original draft preparation: H.-J.K., H.E., K.T.C.; writing, review and editing, H.-J.K., H.E., K.T.C.; supervision, K.T.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (2020R1A2C2005612).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The datasets utilized in this paper DRIVE, CHASE, STARE, HRF, LES-AV, IOSTAR can be accessed publically. Whereas the augmented datasets utilized in the experimentation work will be made available upon request. For reproducibility purposes, the code files and trained models for each dataset will be available to access upon request for research purposes.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Khan, T.M.; Alhussein, M.; Aurangzeb, K.; Arsalan, M.; Naqvi, S.S.; Nawaz, S.J. Residual connection-based encoder decoder network (RCED-Net) for retinal vessel segmentation. IEEE Access 2020, 8, 131257–131272. [Google Scholar] [CrossRef]
- Tungsattayathitthan, U.; Rattanalert, N.; Sittivarakul, W. Long-term visual acuity outcome of pediatric uveitis patients presenting with severe visual impairment. Sci. Rep. 2023, 13, 2919. [Google Scholar] [CrossRef] [PubMed]
- Arenson, R.L.; Andriole, K.P.; Avrin, D.E.; Gould, R.G. Computers in imaging and health care: Now and in the future. J. Digit. Imaging 2000, 13, 145–156. [Google Scholar] [CrossRef]
- Niemeijer, M.; Xu, X.; Dumitrescu, A.V.; Gupta, P.; Van Ginneken, B.; Folk, J.C.; Abramoff, M.D. Automated measurement of the arteriolar-to-venular width ratio in digital color fundus photographs. IEEE Trans. Med. Imaging 2011, 30, 1941–1950. [Google Scholar] [CrossRef]
- Orlando, J.I.; Barbosa Breda, J.; Van Keer, K.; Blaschko, M.B.; Blanco, P.J.; Bulant, C.A. Towards a glaucoma risk index based on simulated hemodynamics from fundus images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part II 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 65–73. [Google Scholar]
- Welikala, R.; Fraz, M.; Foster, P.; Whincup, P.; Rudnicka, A.R.; Owen, C.G.; Strachan, D.; Barman, S.A. Automated retinal image quality assessment on the UK Biobank dataset for epidemiological studies. Comput. Biol. Med. 2016, 71, 67–76. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Huang, X.; Tian, J. Retinal image registration using topological vascular tree segmentation and bifurcation structures. Biomed. Signal Process. Control 2015, 16, 22–31. [Google Scholar] [CrossRef]
- Costa, P.; Galdran, A.; Meyer, M.I.; Niemeijer, M.; Abràmoff, M.; Mendonça, A.M.; Campilho, A. End-to-end adversarial retinal image synthesis. IEEE Trans. Med. Imaging 2017, 37, 781–791. [Google Scholar] [CrossRef]
- Zana, F.; Klein, J.C. Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Trans. Image Process. 2001, 10, 1010–1019. [Google Scholar] [CrossRef]
- Mendonca, A.M.; Campilho, A. Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction. IEEE Trans. Med. Imaging 2006, 25, 1200–1213. [Google Scholar] [CrossRef]
- Zhang, J.; Dashtbozorg, B.; Bekkers, E.; Pluim, J.P.; Duits, R.; ter Haar Romeny, B.M. Robust retinal vessel segmentation via locally adaptive derivative frames in orientation scores. IEEE Trans. Med. Imaging 2016, 35, 2631–2644. [Google Scholar] [CrossRef]
- Ryu, J.; Rehman, M.U.; Nizami, I.F.; Chong, K.T. SegR-Net: A deep learning framework with multi-scale feature fusion for robust retinal vessel segmentation. Comput. Biol. Med. 2023, 163, 107132. [Google Scholar] [CrossRef] [PubMed]
- Amin, J.; Sharif, M.; Yasmin, M. A review on recent developments for detection of diabetic retinopathy. Scientifica 2016, 2016, 6838976. [Google Scholar] [CrossRef] [PubMed]
- Bek, T. Diameter changes of retinal vessels in diabetic retinopathy. Curr. Diabetes Rep. 2017, 17, 82. [Google Scholar] [CrossRef] [PubMed]
- Mayya, V.; Kamath, S.; Kulkarni, U. Automated microaneurysms detection for early diagnosis of diabetic retinopathy: A Comprehensive review. Comput. Methods Programs Biomed. Update 2021, 1, 100013. [Google Scholar] [CrossRef]
- Rehman, M.U.; Ryu, J.; Nizami, I.F.; Chong, K.T. RAAGR2-Net: A brain tumor segmentation network using parallel processing of multiple spatial frames. Comput. Biol. Med. 2023, 152, 106426. [Google Scholar] [CrossRef]
- Rehman, M.U.; Akhtar, S.; Zakwan, M.; Mahmood, M.H. Novel architecture with selected feature vector for effective classification of mitotic and non-mitotic cells in breast cancer histology images. Biomed. Signal Process. Control 2022, 71, 103212. [Google Scholar] [CrossRef]
- Liskowski, P.; Krawiec, K. Segmenting retinal blood vessels with deep neural networks. IEEE Trans. Med. Imaging 2016, 35, 2369–2380. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Yan, Z.; Yang, X.; Cheng, K.T. Joint segment-level and pixel-wise losses for deep learning based retinal vessel segmentation. IEEE Trans. Biomed. Eng. 2018, 65, 1912–1923. [Google Scholar] [CrossRef] [PubMed]
- Gu, L.; Zhang, X.; Zhao, H.; Li, H.; Cheng, L. Segment 2D and 3D filaments by learning structured and contextual features. IEEE Trans. Med. Imaging 2016, 36, 596–606. [Google Scholar] [CrossRef]
- Dong, S.; Zhao, J.; Zhang, M.; Shi, Z.; Deng, J.; Shi, Y.; Tian, M.; Zhuo, C. Deu-net: Deformable u-net for 3d cardiac mri video segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part IV 23. Springer: Berlin/Heidelberg, Germany, 2020; pp. 98–107. [Google Scholar]
- Fu, H.; Xu, Y.; Lin, S.; Kee Wong, D.W.; Liu, J. Deepvessel: Retinal vessel segmentation via deep learning and conditional random field. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; Proceedings, Part II 19. Springer: Berlin/Heidelberg, Germany, 2016; pp. 132–139. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Li, X.; Li, M.; Yan, P.; Li, G.; Jiang, Y.; Luo, H.; Yin, S. Deep learning attention mechanism in medical image analysis: Basics and beyonds. Int. J. Netw. Dyn. Intell. 2023, 2, 93–116. [Google Scholar] [CrossRef]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. arXiv 2017, arXiv:1704.02971. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Islam, M.M.; Poly, T.N.; Walther, B.A.; Yang, H.C.; Li, Y.C. Artificial intelligence in ophthalmology: A meta-analysis of deep learning models for retinal vessels segmentation. J. Clin. Med. 2020, 9, 1018. [Google Scholar] [CrossRef]
- Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
- Zhang, J.; Zhang, Y.; Xu, X. Pyramid u-net for retinal vessel segmentation. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 1125–1129. [Google Scholar]
- Liu, X.; Zhang, D.; Yao, J.; Tang, J. Transformer and convolutional based dual branch network for retinal vessel segmentation in OCTA images. Biomed. Signal Process. Control 2023, 83, 104604. [Google Scholar] [CrossRef]
- Chen, D.; Yang, W.; Wang, L.; Tan, S.; Lin, J.; Bu, W. PCAT-UNet: UNet-like network fused convolution and transformer for retinal vessel segmentation. PLoS ONE 2022, 17, e0262689. [Google Scholar] [CrossRef]
- Sun, Y.; Dong, M.; Yu, M.; Zhu, L. MBHFuse: A multi-branch heterogeneous global and local infrared and visible image fusion with differential convolutional amplification features. Opt. Laser Technol. 2025, 181, 111666. [Google Scholar] [CrossRef]
- Zhang, F.; Lin, S.; Xiao, X.; Wang, Y.; Zhao, Y. Global attention network with multiscale feature fusion for infrared small target detection. Opt. Laser Technol. 2024, 168, 110012. [Google Scholar] [CrossRef]
- Li, W.; Lambert-Garcia, R.; Getley, A.C.; Kim, K.; Bhagavath, S.; Majkut, M.; Rack, A.; Lee, P.D.; Leung, C.L.A. AM-SegNet for additive manufacturing in situ X-ray image segmentation and feature quantification. Virtual Phys. Prototyp. 2024, 19, e2325572. [Google Scholar] [CrossRef]
- Zhou, W.; Bai, W.; Ji, J.; Yi, Y.; Zhang, N.; Cui, W. Dual-path multi-scale context dense aggregation network for retinal vessel segmentation. Comput. Biol. Med. 2023, 164, 107269. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Shen, J.; Yang, L.; Bian, G.; Yu, H. ResDO-UNet: A deep residual network for accurate retinal vessel segmentation from fundus images. Biomed. Signal Process. Control 2023, 79, 104087. [Google Scholar] [CrossRef]
- Jiang, Y.; Liang, J.; Cheng, T.; Lin, X.; Zhang, Y.; Dong, J. MTPA_Unet: Multi-scale transformer-position attention retinal vessel segmentation network joint transformer and CNN. Sensors 2022, 22, 4592. [Google Scholar] [CrossRef]
- Xie, Y.; Shang, J.; Yang, Q.; Qian, X.; Zhang, H.; Tang, X. ARSA-UNet: Atrous residual network based on Structure-Adaptive model for retinal vessel segmentation. Biomed. Signal Process. Control 2024, 96, 106595. [Google Scholar] [CrossRef]
- Zhang, Y.; Lan, Q.; Sun, Y.; Ma, C. A combination of multi-scale and attention based on the U-shaped network for retinal vessel segmentation. Int. J. Imaging Syst. Technol. 2024, 34, e23045. [Google Scholar] [CrossRef]
- Fang, T.; Cai, Z.; Fan, Y. Gabor-net with multi-scale hierarchical fusion of features for fundus retinal blood vessel segmentation. Biocybern. Biomed. Eng. 2024, 44, 402–413. [Google Scholar] [CrossRef]
- Staal, J.; Abràmoff, M.D.; Niemeijer, M.; Viergever, M.A.; Van Ginneken, B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging 2004, 23, 501–509. [Google Scholar] [CrossRef]
- Fraz, M.M.; Remagnino, P.; Hoppe, A.; Uyyanonvara, B.; Rudnicka, A.R.; Owen, C.G.; Barman, S.A. An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans. Biomed. Eng. 2012, 59, 2538–2548. [Google Scholar] [CrossRef]
- Hoover, A.; Kouznetsova, V.; Goldbaum, M. Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Trans. Med. Imaging 2000, 19, 203–210. [Google Scholar] [CrossRef] [PubMed]
- Budai, A.; Bock, R.; Maier, A.; Hornegger, J.; Michelson, G. Robust vessel segmentation in fundus images. Int. J. Biomed. Imaging 2013, 2013, 154860. [Google Scholar] [CrossRef] [PubMed]
- Rehman, M.U.; Eesaar, H.; Abbas, Z.; Seneviratne, L.; Hussain, I.; Chong, K.T. Advanced drone-based weed detection using feature-enriched deep learning approach. Knowl.-Based Syst. 2024, 305, 112655. [Google Scholar] [CrossRef]
- Lin, J.; Huang, X.; Zhou, H.; Wang, Y.; Zhang, Q. Stimulus-guided adaptive transformer network for retinal blood vessel segmentation in fundus images. Med. Image Anal. 2023, 89, 102929. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Ni, W.; Luo, Y.; Feng, Y.; Song, R.; Wang, X. TUnet-LBF: Retinal fundus image fine segmentation model based on transformer Unet network and LBF. Comput. Biol. Med. 2023, 159, 106937. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).