Abstract
Convolutional neural networks have demonstrated efficacy in acquiring local features and spatial details; however, they struggle to obtain global information, which could potentially compromise the segmentation of important regions of an image. Transformer can increase the expressiveness of pixels by establishing global relationships between them. Moreover, some transformer-based self-attentive methods do not combine the advantages of convolution, which makes the model require more computational parameters. This work uses both Transformer and CNN structures to improve the relationship between image-level regions and global information to improve segmentation accuracy and performance in order to address these two issues and improve the semantic segmentation results at the same time. We first build a Feature Alignment Module (FAM) module to enhance spatial details and improve channel representations. Second, we compute the link between similar pixels using a Transformer structure, which enhances the pixel representation. Finally, we design a Pyramid Convolutional Pooling Module (PCPM) that both compresses and enriches the feature maps, as well as determines the global correlations among the pixels, to reduce the computational burden on the transformer. These three elements come together to form a transformer-based semantic segmentation feature fusion network (FFTNet). Our method yields 82.5% mIoU, according to experimental results based on the Cityscapes test dataset. Furthermore, we conducted various visualization tests using the Pascal VOC 2012 and Cityscapes datasets. The results show that our approach outperforms alternative approaches.
Similar content being viewed by others
Introduction
Deep learning computer vision approaches have gained popularity among researchers due to their rapid advancements in the fields of object recognition1, image restoration2, semantic analysis3 and image processing4. A key method in computer vision is semantic segmentation. In order to perform semantic segmentation, each pixel in an image must be labeled5,6,7. Each pixel is then assigned a semantic category, and acquiring the semantic and spatial information linked to each category—such as the object’s location and category—necessitates thorough contextual semantic information. Semantic segmentation is widely used in various fields, such as augmented reality8, medical image analysis9, human–computer interaction10, and autonomous vehicle driving11. Convolutional neural networks (CNNs), particularly Fully Convolutional Networks (FCN)3, have advanced to the point where they can now train models end-to-end, enhancing the effectiveness of semantic segmentation. Ever since Fully Convolutional Networks (FCNs) have become the standard for dense prediction, they have sparked a great deal of follow-up research. Since then, many researchers have turned their attention to multi-scale contextual characteristics, as evidenced by the work on SegNet12, DeconvNet13, and the Deeplab article series14,15,16. Dense segmentation is not supported by DeepLabv314 and DeepLabv3 +15, which extract multiscale contextual semantic information through atrous convolution with different hole rates. DeepLabv3 introduces grid artifacts by excessively utilizing atrous convolution. Similarly, PSPNet17 extracts multiscale contextual semantic information by utilizing pyramidal pooling blocks consisting of adaptive average pooling at different scales. However, this method has the drawback of ignoring the relationship between a pixel point and its set of neighboring pixels. Since the pixels extracted by PSPNet lack rich contextual information, SA-FFNet18 introduces VH-CAM and UC-PPM to enhance the contextual information of pixels. However, SA-FFNet utilizes the pyramid pooling method, which inevitably leads to the loss of some valuable information.
PSANet19 predicts the attention feature map, which aggregates contextual semantic information at each location. ANNet20 utilizes asymmetric pixel-to-pixel long-range dependencies to enhance the accuracy of semantic segmentation by considering the correlation between pixels and their adjacent neighbors. In contrast, DANet21 calculates pixel-to-pixel long-range dependencies to enhance pixel representations based on feature map channels and spatial locations. OCRNet22 aggregates all features using a similarity map to approximate the context of each pixel object by learning all pixel-to-pixel similarities through a self-attention method. The disadvantage of both approaches is that they only consider the sparse semantic information contained in a single pixel.
The spatial information is more precise at the lowest level of the network, but there is a lack of semantic coherence. Conversely, the feature graph in the advanced stage of the network has relatively rough spatial information but powerful consistent semantic information. To address this, DFNet23 first adds a global average pooling layer and employs a V-shaped structure, instead of a U-shaped one, to capture multi-scale contextual semantic information. SFS and FFF modules were proposed by FSN24 to extract important features and integrate them in a flexible way.
Traditional convolutional neural networks extract feature maps with gradually decreasing resolution, a small receptive field that overlooks long-range pixel relationships, and lacks global information. Attempts have been made to apply the Transformer25, which has shown remarkable success in Natural Language Processing (NLP), to tasks involving images. For picture classification, Dosovitskiy et al. introduced the Vision Transformer (ViT)26. The authors achieved outstanding performance on ImageNet by segmenting the image into numerous linearly embedded chunks and providing them to the standard Transformer via positional embedding (PE), following the NLP Transformer architecture. ViT performs well, but it also has drawbacks such as high computational costs and the inability to generate multi-scale features on large images. Zheng et al. suggested using SETR27 in semantic segmentation to demonstrate that transformer could be effectively utilized in this task. Semantic segmentation has significantly improved with the integration of transformer in computer vision. The transformer can dynamically compute the relationship weights between global pixels and adjust to various input images. It is a very effective tool for capturing global information. These tools are particularly useful for extracting complex semantic information. However, a drawback is that the transformer complicates the calculation between features, requiring a sufficient amount of data for optimal functioning. Furthermore, the Transformer model is insensitive to the fine details of a picture, especially for objects that are multi-scale, such as small targets that are far away. To address the redundancy in network learning and global dependence, UniFormer28 utilizes convolution and transformer mechanisms to extract both global and local information.
Considering the above analysis, combining the advantages of Transformer and CNN can effectively enhance the decomposition effect. Simultaneously, the model can be made appropriately lightweight. Therefore, in this paper, we propose a Transformer-based semantic segmentation feature fusion network. Our model incorporates multi-scale interaction and self-attention mechanisms. This enables adaptive feature map information distinction, removal of redundant information, and establishment of long-distance links with feature maps of each channel. Our model incorporates Feature Aggregation Module (FAM) to enhance spatial details and channel representation, and a Transformer structure to capture global information, establish relationships between pixels, and improve pixel representation. Additionally, Progressive Contextual Pooling Module (PCPM) compresses and enriches feature maps using multiple pooling layers to decrease the computational burden on the Transformer. The aggregation context enables the determination of pixel relationships from a global perspective.
In summary, the contribution of this paper can be outlined as follows:
-
We propose a fast and efficient semantic segmentation feature fusion network based on the transformer (FFTNet).
-
By combining spatial details at the low level with semantic information at the high level, our proposed Feature Alignment Module (FAM) effectively establishes long-distance dependencies. The transformer structure efficiently extracts spatial local and global information, while the Pyramid Convolutional Pooling Module (PCPM) aggregates contexts to enhance segmentation speed and accuracy.
-
Our model achieves superior performance on three widely used datasets: ADE20k, Pascal VOC 2012, and Cityscapes. To develop our model, we utilize a single Tesla V100.
Related work
This section provides a brief overview of related work on transformer, convolution, multi-scale feature fusion, encoder-decoder structure, and so on.
Encoder-decoder structure
Typically, the encoder enlarges the receptive field by decreasing the spatial size of the feature map. After that, the decoder receives the encoded feature maps and uses them to determine how big the projected maps should be. SegNet employs an encoder-decoder structure that preserves maximum pooling, for instance. This method increases the accuracy and speed of segmentation while simultaneously saving memory. Our approach takes advantage of the encoder-decoder structure.
Multiscale feature contextual information
FCN converts the classification network to a full CNN to integrate semantic and representational information; however, this disregards the high-resolution feature maps, lowering the edge information’s quality and resulting in edge information loss. A common direction in computer vision is designing multiscale networks, which is important for tasks resembling segmentation. Multiscale modules are typically found in encoders and decoders for segmentation models. The goal of MCRNe29 is to use depth context to direct multilevel fusion; however, combining several feature maps from several stages causes slowness and duplication in the model. Erroneous segmentation may happen when the item is too big and the perceptual field is too tiny, as the network will not be able to see it in the end. A context-integrated network for semantic segmentation is proposed by CENet30. Its high hardware and software requirements and overfitting susceptibility are its drawbacks. Furthermore, Googlenet31 achieves multi-scale feature extraction with a multi-branch design. In order to produce multi-scale features, HRnet32 additionally keeps high-resolution features at a deeper stage that are combined with low-resolution features.
Furthermore, in the event that the item is too small but the sensing area is too large, the model will pick up extraneous information, leading to pixel misclassification because the network will find it challenging to interpret small objects. To increase segmentation accuracy, Zhang et al.33 suggested a method based on pyramid consistency learning. Its shortcomings include the high processing resource requirements and insufficient feature processing, which might result in the loss of certain detailed data. While RefineNet34 and GCNet35 integrate feature maps that are specific to various phases of a multi-scale context, they do not have a unified global context. Deeplabv3 creates extended ASPP and cascaded or parallel nondifferential convolution without requiring Dense CRF post-processing to produce satisfactory results. PSPNet integrates data from several scales into the PSP module in order to extract multi-scale contextual semantic knowledge. In order to learn pixel-by-pixel similarity, OCNet employs a self-attention strategy. A similarity attention map is then used to summarize all features and approximate the object’s context. By determining pixel-region associations, OCRNet improves the pixel-region representations. In order to enhance the pixel representation and, consequently, the segmentation performance, contextual information is captured by the DANet spatial attention module and the channel attention module.
The aforementioned strategy concentrates on creating a sophisticated attention module, which ineluctably requires more computations and is unable to successfully build long-distance dependencies. We take into consideration enhancing the correlation between long-distance pixel dependencies, minimizing the cost of matrix calculation, and effectively fusing global and local information in order to effectively fuse global and local information. In this paper, we suggest the use of transformer structure and PCPM to construct long-distance inter-pixel interactions and integrate global and local information.
Convolution and transformer
Because CNN have local perceptual fields and shared weights, they can efficiently extract spatial features with less computational requirements.CNNs’ capacity for generalization is further improved by their translational invariance. Unfortunately, CNN’s limited perceptual field prevents it from capturing both the global information and the interdependence between distant pixels. Conversely, Transformer can extend the perceptual area and extract global information very well, but it is hard to train and insensitive to little details. Subsequent researchers used CNN’s and Transformer’s benefits to tackle these issues.
The redundancy and long-distance interdependence between pixels in network learning are successfully resolved by UniFormer28, which uses CNN and transformer to extract global and local information. Conformer36 and RFTNet37 improve the feature representation capacity of network learning by fusing the capabilities of transformer to create global associations with the capabilities of CNN to extract local features. Mobile-former38 extracts local features at the pixel level by using effective depthwise and pointwise convolution. Convolution and transformer combined improve the aspect of global interaction and significantly minimize computation. In order to extract the global and local information of feature maps, respectively, AMACF39 combines the self-attention mechanism with the convolutional network. In the field of medical image segmentation, TransUNet40 and TransBTS41 applied Transformer and UNet42 and produced very satisfactory segmentation results. However, their limitation is that they need a lot of training data and computational power to train and optimize the model, which means they cannot be used in environments with limited resources or volume. The utilization of numerous heads of attention in TranSiam43 is a drawback as it increases the complexity of the model and necessitates more time and computing power. Combining convolution and attention can be applied to process images of different sizes and increase segmentation accuracy, as suggested by Coatnet44. Its usage of several convolutional kernels of varying sizes results in higher model parameters and an increased risk of model overfitting, which is a drawback.
Motivated by the aforementioned techniques, this work introduces FFTNet, comprising FAM, Transformer, and PCPM. Convolution is employed in the FAM module to combine information and enhance spatial details. In the transformer structure, the use of self-attention enhances the pixel representation constructs global contextual semantic information for the pixels, and finally fuses the local features with the global information. Convolution and adaptive average pooling in PCPM allow for the additional extraction of local features and the reduction of the number of channels, resulting in dense segmentation with rich contextual semantic information for every pixel. To construct distant interdependencies between pixels and collect global information, we combine the benefits of Transformer and Convolution. Convolution pulls local features from the feature map and compresses the number of channels to obtain spatial detail information. This results in rich contextual semantic information for each pixel and achieves dense segmentation. Our approach not only incorporates convolution but also establishes long-distance interdependencies between pixels and captures global information by fully exploiting the benefits of transformer and convolution.
Development of semantic segmentation based on SAM
The Segment Anything45 project introduces a new task, model, and dataset for image segmentation. It uses efficient models in a data collection loop to build the largest semantic segmentation dataset to date, containing over 1 billion masks and 11 million images.Segment Anything Model has a powerful zero-sample capability to segment new image distributions and tasks with no additional training. This capability allows SAM to match or outperform even previous fully supervised results across multiple tasks. Segment Anything 246 is a further extension of the SAM project, which addresses the difficulties of image segmentation models in dealing with significant changes in appearance due to motion, deformation, occlusion, lighting changes, etc., especially for video processing. SAM2, while maintaining the advantages of the SAM model, further improves the accuracy and efficiency of the segmentation, allowing the model to be better adapted to the various application tasks.
Also they have some disadvantages. In order to train a high-performance SAM model, a large amount of training data is required to support it. This may limit its use in certain scenarios, especially in data-scarce domains or tasks. Since the training of SAM requires a large amount of computational resources and time, the training process may be lengthy. This may affect the rapid iteration and deployment of the model.
Method
We shall go into great detail about our whole network in this part. In this section, we will first provide an overview of our network’s general structure, which makes use of the encoder-decoder architecture. Then we will explain the impact of FAM, Transformer and PCPM modules on the performance of semantic segmentation in the next sections, respectively.
FFTNet overall network structure diagram
Here we briefly describe the Semantic segmentation feature fusion network based on transformer (FFTNet). The backbone network that we will be using to extract rich feature maps is ResNet10147. Given an image \(\begin{array}{c}F\in {R}^{3\times H\times W}\end{array}\) , here H, W, and 3 stand for width, height, and number of channels respectively. In this study, we chose the third and fourth stage phases of ResNet101. First, we utilize ResNet101 to obtain the multi-scale feature map \(\begin{array}{c}{F}_{i}\in {R}^{{2}^{i-1}C\times \frac{H}{{2}^{i+1}}\times \frac{H}{{2}^{i+1}}}\end{array}\), here, \(\text{i}\in 3,\) 4 denotes the indices of the three or four stages of ResNet101. As illustrated in Fig. 1, we create a Feature Alignment Module (FAM) to fuse the feature maps of the three or four stages at different scales. We then merge the high-level stage feature maps with strong semantic information with the low-level stage feature maps with strong spatial information to create an augmented feature map \(B\in {R}^{D\times \frac{H}{{2}^{3}}\times \frac{W}{{2}^{3}}}\), where D is the number of channels after fusion. Subsequently, we use it in the PCPM module to efficiently extract the feature map information by adaptively differentiating the significance of every channel in the feature map. Additionally, the transformer module is used to improve pixel representation, extract the spatial detail features from the feature map, and determine the global information relationship. Lastly, to create a predicted map with the same resolution as the original map, the feature maps of both are matrix multiplied and up-sampled using multi-scale feature map interpolation.
Feature alignment module (FAM)
Empirical research14,48,49 has demonstrated that merging multi-scale characteristics can enhance semantic segmentation since different-sized items are displayed in a single image. Multi-scale features with varying resolutions are present in the feature maps for every step of the Backbone ResNet101 network. Low-resolution features will have more semantic information than high-resolution feature maps, whereas high-resolution feature maps have more spatial detail information. Furthermore, because of the limits of the sensory field, large-scale items have poor semantic information after serial downsampling, while small-scale objects have less explicit location information.
As demonstrated in Fig. 2, we fuse the feature maps of the high-level and low-level stages in order to appropriately extract the information. The channel attention map is encoded with the feature map information using depth-separable convolution and average pooling. After sigmoid and average pooling, the third-stage feature maps will be weighted and fused with the up-sampled fourth-stage feature maps. This will allow for the assignment of distinct weights to the various levels of relevance within the feature maps, resulting in the generation of additional information from the feature maps. Average pooling can effectively prevent the development of overfitting phenomena, decrease the model’s complexity and computation, and enhance the model’s capacity for generalization. Deformable convolution is then utilized to get the FAM output, which is subsequently fused with the weighted third-stage features. The following factors mostly illustrate the benefits of deformable convolution: 1) In a deformable convolutional network, the input data’s requirements can cause the convolutional sensory field to take on random shapes instead of remaining a fixed square. This feature makes it possible for the model to more precisely capture the target features; in particular, it can enhance the accuracy of detection and better emphasize the target objects in jobs involving many targets and complex backgrounds. 2) The sampling horizon of deformable convolution is bigger than that of normal convolution. This implies that when the convolution operation is carried out, a wider variety of elements are affected, improving the model’s comprehension and application of contextual information.3) Deformable convolution provides good flexibility and generalizability since its operation is dynamically modified based on the input data. This broadens the model’s applicability and helps it adapt more effectively to a range of jobs and contexts. The following equation can be used to illustrate the module’s entire process:
where ɑ represents the weight parameters generated by the third stage in the upper part of Fig. 2, F1 represents the re-weighted third stage feature map, and Fout represents the final output of the FAM module.
Transformer structure
We go into great detail about the Transformer structure in this section. It consists of Patch Embedding, Multi-Head Attention25, and MLP, as seen in Fig. 3. The Transformer has proven to be more effective at feature extraction than CNN for classification, which is crucial for semantic segmentation. It has been applied to visual tasks in a number of ways28,29, and it has also shown outstanding performance in semantic segmentation in experiments. Transformer can dynamically compute the relational weights between global pixels and dynamically adapt to various input images. It is a very effective tool for capturing global information. When extracting sophisticated semantic information, these are particularly beneficial.
Both global and spatial detail information about the feature map can be obtained via our transformer structure. It is divided into three primary sections: 1) To convert the image into a patch embedding, it first travels via Patch Embedding. The primary task is dividing the input image into smaller segments, which are subsequently subjected to convolutional and normalization levels of processing. The convolutional layer first transforms the input image into embedding vectors, from which the height and width are recovered. Next, the shape of the embedding vector is obtained. Subsequently, the embedding vector is moved and rotated. The processed embedding vectors are finally returned, together with the image’s height and width, after the spread embedding vectors have been normalized. 2) The feature map then moves across the layer of self-attention. The shape of the input feature matrix is initially acquired and broken down into batch size and sequence length in the self-attention layer. The dimensions of the input characteristics and the quantity of attention heads are then used to calculate the query vector (q) and the key value vector (kv). The attention weights are then calculated and adjusted. and FFN is used to extract the output feature matrix. The FFN consists of two linear transformations and an activation function. 3) Lastly, the feature map enters the multilayer perceptron (MLP), where the input goes via the activation function, the DWConv layer, the fully connected layer, and then back via the fully connected layer. It is worth noting that skip connections were added in the processes 1) and 2), as this can help to enhance network optimization and address the issue of information decay in the network. This equation can be used to represent this stage:
where Atten stands for Self-Attention Layer, Q, K, and V stand for query, key, and value respectively, which have the same dimension N × C, where N = H × W, dhead stands for multi- head, F stands for feature map after patch embedding, and Foutput stands for the final output of the transformer structure.
Pyramid convolutional pooling module (PCPM)
The drawback of transformer is that it complicates the computation between features, so it requires a sizable enough amount of data to support it. Transformer can extract global information from the feature map, which is very useful for extracting advanced semantic information. In addition, the transformer is not sensitive to the detail information of the image, especially for multi-scale targets and small targets at long distances. Since the transformer segmentation produced an unsatisfactory result when these data were used, we built a pyramid convolutional pooling module (PCPM) to use CNN’s benefits to compensate for the transformer’s shortcomings. Through convolutional operations, high-dimensional image data can be converted into low-dimensional feature representations, thus reducing the amount of computation for processing. In addition, the convolutional layer can fuse different levels of features to generate a feature map with more representational power. In semantic segmentation tasks, this role of dimensionality reduction and feature fusion can further reduce the computational burden of the Transformer. Convolution and Transformer can be divided to handle different features separately. In this way, the hybrid architecture avoids some redundant computations compared to using either one alone.
Figure 4 illustrates the general layout and includes a basic context module. This module’s primary goal is to fuse the input features and aggregate the global context using pyramid pooling in order to extract data from the input feature maps. Three adaptive pooling operations in the pyramid pooling module, with pooling parameters set to1,3,5. Richer feature map information is produced as a result of this operation, enabling each channel’s feature map spatial information to be efficiently extracted. In addition, the spatial information on each channel can be linked to other channels to facilitate information exchange. After that, the output features go through an upsampling and 1 × 1 convolution process. In order to create finer features, we finally add these upsampled features using addition and perform a convolution operation. As a result, the PCPM module can effectively extract multi-scale characteristics and lessen pixel independence across several channels, enabling each pixel to be connected to its neighbor. The following equation can be used to illustrate the aforementioned process:
where Avgpooli represents the adaptive pooling at stage i, Fin represents the input feature map of the PCPM, and F represents the output feature map of the PCPM.
Experiments
Datasets
In this section, we evaluate our method on a number of mainstream datasets, including PASCAL VOC 201250, Cityscapes51, and ADE20k52.
The scene dataset that comprises 20 categories incorporates the PASCAL VOC 2012. It has 2913 pictures in all. 1449 of these images are utilized for validation, 1456 are used for testing, and 1464 are used for training.
One of the datasets for road scenes is Cityscapes. This dataset includes 5000 high-quality, pixel-level annotations of urban driving scenes divided into 30 categories. Of these, 500 are utilized for validation, 2975 are used for training, and 1525 photos are used for testing. Each of the fifty cities’ street sceneries is represented. Furthermore, Cityscapes has 19,998 photos with approximate labels. Only the meticulously labeled photos with 19 categories are used here.
The ADE20k dataset is a large scene-understanding dataset released by MIT in 2016 that contains over 20,000 scene-centered images and 150 semantic labels for different categories. This dataset covers a wide range of object categories such as sky, road, grass, and people, cars, beds, etc., and includes images and annotations of various objects and parts in different contexts.
Implementation details
In order to avoid overfitting, our optimizer selects AdamW53 to train our model, and the weight decay factor is set at 0.01. We apply a ‘poly’ learning rate decay technique, which multiplies the starting learning rate by \({\left(1-\frac{\text{ iter }}{\text{ max\_iter }}\right)}^{\text{power}}\). To improve the model fit at the start of training, we use a warm-up54 technique and set the starting learning rate and weight decay to 6 × 10–5 and 0.9, respectively. technique to decrease oscillation and instability during training, hasten the rate of model convergence, and improve the model’s ability to react to the training data more smoothly at the start of training. Our backbone network is the ResNet101 network, which has been pre-trained on the ImageNet55 dataset. We set the batch size to 4, the iteration to 160,000, and the cropping of the original images to 1024 × 512 for training and validation on the cityscapes dataset. For the PASCAL VOC 2012 and ADE20K datasets, we set the batch size to 8, the iter to 100,000 and 200,000, respectively, and cropped the images to 512 × 512 during the training and validation phase. Iter is set to 200,000 and 100,000, and size is set to 8.
Evaluation metrics
In this paper, we use Pixel Accuracy(PA), Intersection over Union(IoU), and mean of Intersection over Union(mIoU) as our evaluation metrics. Their formulas are as follows:
where PA is the ratio of pixels with correct metrics to the total pixels. The IoU is the ratio of the intersection and concatenation of the true and predicted values for each class computed. The mIoU is the summation of all the classes averaged together. k stands for the class of the segmentation, and pij denotes the number of pixels that would have belonged to class i but was predicted to be in class j. As a result, pij and pji stand for false positives and false negatives, respectively, and pii is the anticipated true number.
Ablation study
Here, we conduct ablation tests to confirm that our approach works as intended. Unless otherwise noted, ResNet101 is chosen as the backbone network for all ablation experiments. We have examined segmentation results on the Cityscapes and Pascal VOC 2012 datasets, and we have confirmed the validity of each FFTNet network module on the Cityscapes dataset.
Feature alignment module (FAM)
We first ablation study on the internal structure of the FAM, as shown in Table 1 we experiment on the weights ɑ, and deformable convolution in the FAM, the experiment proves that adding weighting to the third stage in the FAM increases the mIoU by 0.5% compared to the segmentation metrics when it is not added, which shows that after sigmoid weighting can control the magnitude of the weights’ updating in the process of back-propagation, which helps to avoid the problem of vanishing gradient and increase the effectiveness of the network. In Table 2 it is demonstrated that the mIoU is 0.3% higher than the normal convolution when deformable convolution is used. According to the tests, deformable convolution is an enhanced convolution with an unevenly shaped sensation field and a bigger sampling field of view than the corresponding version of the standard convolution. As opposed to the classic convolution, which can only extract the features of a rectangular box, this convolution can more precisely extract the features we seek because it is not performed on a traditional N×N grid. Furthermore, segmentation performance can be enhanced with the use of deformable convolution.
Pyramid convolutional pooling module (PCPM)
In this section, we perform ablation experiments on the number of pooling layers and their values in PCPM. The model can adapt to varying input sizes and shapes more effectively thanks to the AdaptiveAvgPool layer. Through the configuration of numerous AdaptiveAvgPool layers of varying sizes, the model’s performance can be guaranteed while simultaneously enhancing the model’s efficiency and generalization. As shown in Table 3, we have experimental results with the pooling layer size settings in PCPM set to (1, 3), (1, 3, 5), and (1, 3, 5, 7), respectively.
FFTNet
In this section, we add and subtract the above three modules to observe the overall effect and the experimental results are displayed in Table 4. Table 5 presents a comparison with a few other cutting-edge techniques. To confirm the viability of our model, we also changed the datasets and backbone networks in Table 6, where the amount of network parameters is 4.2M when the backbone network selected is mobilev3, which shows that our proposed model is sufficiently lightweight. Table 4 demonstrate the critical importance of each of our model’s three components—the FAM, the Transformer, and the PCPM —to the end experimental outcomes. Lastly, the PASCAL VOC2012 and Cityscapes datasets were used for visualization experiments, the results of which are displayed in Figs. 5, and 6.
Figure 5 displays the segmentation results for the Pascal voc2012 dataset. The segmentation of the edges of the wine bottle in the first row is unclear, but our model is able to segment it accurately. The center section of a cat is separated into a distinct class in the third row, a human leg is segmented into a horse in the second row, and a cow is segmented into a horse in the fourth row. Our model resolves what is referred to as intra-class inconsistency, as demonstrated in the fourth column, without dividing the same object into multiple objects.
Figure 6 displays the segmentation results for the Cityscapes dataset. Our network correctly classifies the wheels and bicycle into the same class, while other approaches segment the first row’s wheels into multiple classes. Although the second row’s bicycle head contains few texture cues and is challenging to segment, our approach succeeds in doing so. This demonstrates that our model has effective segmentation for both close and far objects and can obtain global information. Similar to this, our model, which improves feature differentiation and pixel region representation, does a good job of improving the trucks and vehicles in the third and fourth rows, which are partially divided into other classes.
Conclusion
This paper proposes Semantic segmentation feature fusion network based on transformer (FFTNet), which consists of four main parts: (1) backbone network, the backbone network is used to extract the image features of the network, its main role is to transform the original input image into a multilayer feature map for subsequent tasks; (2) FAM, Feature Alignment Module (FAM) module can be used not only to enrich spatial details but also enhance the channel representation and thus the representation of the same location; (3) Transformer structure, which can extract global information, create global pixel relationships, and improve pixel representation; (4) PCPM, which can enrich and compress the feature map to lessen the transformer’s computational burden while also figuring out the global relationships between pixels. We conducted extensive experiments on three scene segmentation datasets (Cityscape, Pascal VOC 2012, and ADE20k) to assess the effectiveness of our suggested model. The findings indicate significant improvement in performance, indicating the viability of our proposed model.
Data availability
The raw data utilized in this study are available upon request to the corresponding author.
References
Lin, T.Y., Dollár, P., Girshick, R. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 2117–2125 (2017).
Chen, Y. et al. Research on image inpainting algorithm of improved total variation minimization method. J. Ambient Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-020-02778-2 (2021).
Long, J., Shelhamer, E., Darrell, T., Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 3431–3440 (2015).
Chen, Y. et al. Image super-resolution reconstruction based on feature map attention mechanism. Appl. Intell. 51, 4367–4380 (2021).
Zhou, B., Zhao, H., Puig, X. et al. Scene Parsing through ADE20K Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 5122–5130 Honolulu, HI, pp (2017).
Li, Y., Guo, Y., Kao, Y. & He, R. Image piece learning for weakly supervised semantic segmentation. IEEE Trans. Syst. Man Cybernet. Syst. 47(4), 648–659. https://doi.org/10.1109/TSMC.2016.2623683 (2016).
Teichmann, M., Weber, M., Zollner, M. et al. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving. In 2018 IEEE Intelligent Vehicles Symposium (IV) 1013–1020 IEEE, Changshu (2018).
Alhaija, H.A., Mustikovela, S.K., Mescheder, L. et al. Augmented reality meets deep learning for car instance segmentation in urban scenes. In British machine vision conference, 1 2 (2017)
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (eds Stoyanov, D., Taylor, Z., Carneiro, G. et al.) 3–11 (Springer, 2018).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc IEEE 86, 2278–2324. https://doi.org/10.1109/5.726791 (1998).
Siam, M., Elkerdawy, S., Jagersand, M., Yogamani, S. Deep semantic segmentation for automated driving: Taxonomy, roadmap and challenges. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). 1–8 IEEE, Yokohama (2017).
Badrinarayanan, V., Handa, A., Cipolla, R. (2015) SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. In arXiv preprint arXiv:1505.07293
Noh, H., Hong, S., Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision1520–1528 (2015). https://doi.org/10.1109/ICCV.2015.178
Chen, L. C. et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017).
Chen, L.C., Papandreou, G., Schroff, F. et al. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
Chen, L.C., Zhu, Y., Papandreou, G. et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV) 801–818 (2018).
Zhao, H., Shi, J., Qi, X. et al. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition 2881–2890 (2017).
Zhou, Z., Zhou, Y., Wang, D., Mu, J. & Zhou, H. Self-attention feature fusion network for semantic segmentation. Neurocomputing. 453, 50–59. https://doi.org/10.1016/j.neucom.2021.04.106 (2021).
Zhao, H., Zhang, Y., Liu, S. et al. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European conference on computer vision (ECCV) 267–283 (2018).
Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 593–602 (2019) https://doi.org/10.1109/ICCV.2019.00068
Fu, J., Liu, J., Tian, H. et al. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3146–3154 (2019). https://doi.org/10.1109/CVPR.2019.00326
Yuan, Y., Chen, X., Wang, J. Object-contextual representations for semantic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. 173-190 Springer International Publishing (2020).
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.1857–1866 (2018). https://doi.org/10.1109/CVPR.2018.00199
Lin, F., Wu, T., Wu, S., Tian S, Guo G. Feature selective transformer for semantic image segmentation, arXiv Preprint arXiv:2203.14124 2022. https://doi.org/10.48550/arXiv.2203.14124
Vaswani, A., Shazeer, N., Parmar, N. et al.Attention is all you need. In Proceedings of the Advances in NeuralInformation Processing Systems 5998–6008 (2017).
Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020. arXiv preprint arXiv:2010.11929, 2010.
Zheng, S., Lu, J., Zhao, H. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6881–6890 (2021).
Li, K. et al. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 12581–12600 (2023).
Liu, Q., Dong, Y. & Li, X. Multi-stage context refinement network for semantic segmentation. Neurocomputing 535, 53–63. https://doi.org/10.1016/j.neucom.2023.03.006 (2023).
Zhou, Q. et al. Contextual ensemble network for semantic segmentation. Pattern Recogn. 122, 108290. https://doi.org/10.1016/j.patcog.2021.108290 (2022).
Szegedy, C., Liu, W., Jia, Y., Sermanet et al. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition 1–9 (2015).
Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 4310, 3349–3364 (2020).
Zhang, X. et al. Pyramid geometric consistency learning for semantic segmentation. Pattern Recogn. 133, 109020 (2023).
Lin, G., Milan, A., Shen, C. et al. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 1925–1934 (2017).
Peng, C., Zhang, X., Yu, G. et al. Large kernel matters--improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition 4353–4361 (2017).
Peng, Z., Huang, W., Gu, S. et al. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 367–376 (2021).
Li, T. et al. Refined division features based on transformer for semantic image segmentation. Int. J. Intell. Syst. https://doi.org/10.1155/2023/6358162 (2023).
Chen, Y., Dai, X., Chen, D. et al. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 5270–5279 (2022).
Yang, Q. et al. Attention mechanism and adaptive convolution actuated fusion network for next POI recommendation. Int. J. Intell. Syst. 37(10), 7888–7908 (2022).
Chen, J., Lu, Y., Yu, Q. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
Wang, W., Chen, C., Ding, M. et al. Transbts:Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-AssistedIntervention 109–119 (2021) https://doi.org/10.1007/978-3-030-87193-2_11
Ronneberger, O., Fischer, P., Brox, T. U-net:Convolutional networks for biomedical image segmentation[C]. In Proceedings of the International Conference on Medical Image Computing and Computer-AssistedIntervention 234–241 (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Li, X., Ma, S., Tang, J. et al. TranSiam: Fusing multimodal visual features using transformer for medical image segmentation. arXiv preprint arXiv:2204.12185 (2022).
Dai, Z. et al. Coatnet: Marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 34, 3965–3977 (2021).
Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023).
Ravi, N. et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024).
He, K., Zhang, X., Ren, S. et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778 (2016).
Zhang, H., Dana, K., Shi, J. et al. Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 7151–7160 (2018).
Liang-Chieh, C., Papandreou, G., Kokkinos, I. et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. In International conference on learning representations (2015).
Everingham, M., Winn, J., 2011. The pascal visual object classes challenge 2011 (voc2011) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep, 8.
Cordts, M., Omran, M., Ramos, S. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3213–3223 (2016). https://doi.org/10.1109/CVPR.2016.350
Zhou, B., Zhao, H., Puig, X. et al. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition 633–641 (2017).
Loshchilov, I., Hutter, F. Fixing weight decay regularization in Adam (2018).
Sutskever, I., Martens, J., Dahl, G. et al. On the importance of initialization and momentum in deep learning. In International conference on machine learning. PMLR, 1139–1147 (2013).
Deng, J. A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009). https://doi.org/10.1109/CVPR.2009.5206848
Wang, H., Zhu, Y., Green, B. et al. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV. 108–126 Cham: Springer International Publishing, (2020) https://doi.org/10.1007/978-3-030-58548-8_7
Li, X., You, A., Zhu, Z. et al. Semantic flow for fast and accurate scene parsing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 775-793 Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-58452-8_45
Choe, S et al. Open-set domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024).
Zhang, B. et al. Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024).
Acknowledgements
The authors gratefully acknowledge the financial support from the National Natural Science Foundation of China (Grant No. 61472220, 61572286).
Funding
National Natural Science Foundation of China, Grant/Award Numbers: 61472220, 61572286.
Author information
Authors and Affiliations
Contributions
Tianping Li and Zhaotong Cui wrote the main manuscript text. Zhaotong Cui completed experiment. Hua Zhang is responsible for oversight. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Informed consent
For information or images that may lead to the identification of research participants, they are all from publicly available datasets and have obtained informed consent.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, T., Cui, Z. & Zhang, H. Semantic segmentation feature fusion network based on transformer. Sci Rep 15, 6110 (2025). https://doi.org/10.1038/s41598-025-90518-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-90518-x
- Springer Nature Limited