Abstract
Transformer-CNN architectures have achieved state-of-the-art on 3D medical image segmentation due to their ability to capture both long-term dependencies and local information. However, directly using the existing transformers as encoders can be inefficient, particularly when dealing with high-resolution 3D medical images. This is due to the fact that self-attention computes pixel-to-pixel relationships, which is computationally expensive. Despite attempts to mitigate this through the use of local-window attention or axial-wise attention, these methods may result in the loss of interaction between certain local regions during the self-attention computation. Instead of using the sparsified attention, we aim to incorporate the relationships between all pixels while substantially reducing the computational demand. Inspired by the low-rank property of attention, we hypothesized that the pixel-to-pixel relationship can be approximated by the composition of the plane-to-plane relationship. We propose TriAxial Low-Rank Transformer Network (TALoRT-Net) for medical image segmentation. The core of this model lies in its attention module, which approximates pixel-to-pixel attention matrix using the low-rank representation of the product of plane-to-plane matrices and significantly reduces the computation complexity inherent in 3D self-attention. Moreover, we replaced the linear projection and vanilla Multi-Layer Perceptron (MLP) in Vision Transformer with a convolutional stem and depthwise convolution layer (DCL) to further reduce the number of model parameters. We evaluated the performance of the method on the public BTCV dataset, which significantly reduce the computational effort while maintaining uncompromised accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., et al.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Broxson, B.J.: The kronecker product (2006)
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Devaraj, S.J.: Emerging paradigms in transform-based medical image compression for telemedicine environment. In: Telemedicine technologies, pp. 15–29. Elsevier (2019)
Dong, X., Bao, J., et al.: CSwin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12124–12134 (2022)
Dong, X., Lei, Y., et al.: Synthetic MRI-aided multi-organ segmentation on male pelvic CT using cycle consistent deep attention network. Radiother. Oncol. 141, 192–199 (2019)
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fang, X., Yan, P.: Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction. IEEE Trans. Med. Imaging 39(11), 3619–3629 (2020)
Gibson, E., Giganti, F., et al.: Automatic multi-organ segmentation on abdominal ct with dense V-networks. IEEE Trans. Med. Imaging 37(8), 1822–1834 (2018)
Gou, S., Tong, N., et al.: Self-channel-and-spatial-attention neural network for automated multi-organ segmentation on head and neck ct images. Phys. Med. Biol. 65(24), 245034 (2020)
Hatamizadeh, A., Nath, V., et al.: Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI Brainlesion Workshop, pp. 272–284. Springer (2021). https://doi.org/10.1007/978-3-031-08999-2_22
Hatamizadeh, A., Tang, Y., et al.: UNETR: transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 574–584 (2022)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Ho, J., Kalchbrenner, N., et al.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Hu, E.J., Shen, Y., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Huang, L., Yuan, Y., et al.: Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 (2019)
Huang, Z., Wang, X., et al.: CCNet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/Cvf International Conference on Computer Vision, pp. 603–612 (2019)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Jaderberg, M., Vedaldi, A., et al.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014)
Landman, B., Xu, Z., et al.: Miccai multi-atlas labeling beyond the cranial vault-workshop and challenge. In: Proceedings of MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, vol. 5, p. 12 (2015)
Lee, H.H., Bao, S., et al.: 3d UX-Net: a large kernel volumetric convnet modernizing hierarchical transformer for medical image segmentation. arXiv preprint arXiv:2209.15076 (2022)
Liu, Z., Lin, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Paszke, A., Gross, S., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Peng, Z., Fang, X., et al.: A method of rapid quantification of patient-specific organ doses for ct using deep-learning-based multi-organ segmentation and gpu-accelerated monte carlo dose computing. Med. Phys. 47(6), 2526–2536 (2020)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Touvron, H., Cord, M., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Tu, Z., Talebi, H., et al.: Maxvit: multi-axis vision transformer. In: Computer Vision-ECCV 2022: 17th European Conference Proceedings, Part XXIV, pp. 459–479. Springer (2022). https://doi.org/10.1007/978-3-031-20053-3_27
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-attention for medical image segmentation. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 36–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_4
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 108–126. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7
Wang, S., Li, B.Z., et al.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Wu, S., Wu, T., et al.: Pale transformer: a general vision transformer backbone with pale-shaped attention. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2731–2739 (2022)
Zhu, Q., Wang, Y., et al.: Multi-view coupled self-attention network for pulmonary nodules classification. In: Proceedings of the Asian Conference on Computer Vision, pp. 995–1009 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shang, J., Fang, X. (2024). TriAxial Low-Rank Transformer for Efficient Medical Image Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-8432-9_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)