A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks
<p>Symmetric structure of an ESCA module. This module has four attention sub-modules, and the input image will be attention-extracted in the order of Channel-Spatial (Height-Width)-Channel attention.</p> "> Figure 2
<p>A detailed diagram of the channel attention module. As illustrated in the figure, the channel attention module applies GAP to the input image, performs 1D convolution with a kernel <span class="html-italic">k</span> to obtain the weight map, and then multiplies it with the input features to produce the weighted feature.</p> "> Figure 3
<p>Detailed structure of ResNet-50, ESCA-ResNet-50, and ESCA-ResNet-101.</p> "> Figure 4
<p>Visualization of Grad-CAM. In the last layer of the YOLOv8-cls network with different attention, we employ Grad-CAM to visualize the input images, and it can be found that our ESCA module pays more attention to the key features of the target.</p> ">
Abstract
:1. Introduction
- We propose an optimized, effective attention module that enhances the model’s generalization ability.
- We verify the negative effects of GMP and the correctness of our modular symmetrization design through extensive ablation studies.
- We validate that both classification and recognition tasks on multiple benchmarks (Mini ImageNet, CIFAR-10 and VOC 2007) are greatly improved using different network architectures (ResNet, MobileNet, YOLO) by embedding ESCA.
2. Related Work
3. Efficient Spatial and Channel Attention
3.1. Review SE, CBAM, CA, and ECA Modules
3.2. Efficient Spatial and Channel Attention (ESCA) Module
3.3. Discussion
4. Experiments
4.1. Experiments Preparation
4.2. Image Classification on Mini Imagenet
4.2.1. Effect of GAP and GMP on the ESCA Module
4.2.2. Impact of 1D Kernel Size on ESCA
4.2.3. Contrasts Using Different Networks
4.2.4. Image Classification on CIFAR-10
4.2.5. Object Detection on VOC 2007
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Guo, Y.; Cao, X.; Liu, B.; Gao, M. Cloud Detection for Satellite Imagery Using Attention-Based U-Net Convolutional Neural Network. Symmetry 2020, 12, 1056. [Google Scholar] [CrossRef]
- Ayoub, S.; Gulzar, Y.; Reegu, F.A.; Turaev, S. Generating Image Captions Using Bahdanau Attention Mechanism and Transfer Learning. Symmetry 2022, 14, 2681. [Google Scholar] [CrossRef]
- Yang, W.; Yuan, Y.; Zhang, D.; Zheng, L.; Nie, F. An Effective Image Classification Method for Plant Diseases with Improved Channel Attention Mechanism aECAnet Based on Deep Learning. Symmetry 2024, 16, 451. [Google Scholar] [CrossRef]
- Wang, H.; Liu, J.; Tan, H.; Lou, J.; Liu, X.; Zhou, W.; Liu, H. Blind Image Quality Assessment via Adaptive Graph Attention. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
- Li, Y.; Yang, X.; Fu, J.; Yue, G.; Zhou, W. Deep Bi-directional Attention Network for Image Super-Resolution Quality Assessment. arXiv 2024, arXiv:2403.10406. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Li, Y.; Li, X.; Yang, J. Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN. In Proceedings of the Computer Vision—ACCV 2022: 16th Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2023; pp. 316–332. [Google Scholar] [CrossRef]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIV. Springer: Berlin/Heidelberg, Germany, 2018; pp. 122–138. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
- Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
- Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3138–3147. [Google Scholar] [CrossRef]
- Zhang, Q.L.; Yang, Y.B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
- Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep Networks. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 6789–6801. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 201; pp. 1–9. [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
- Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is Second-Order Information Helpful for Large-Scale Visual Recognition? In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2089–2097. [Google Scholar] [CrossRef]
- Li, Y.; Wang, N.; Liu, J.; Hou, X. Factorized Bilinear Models for Image Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2098–2106. [Google Scholar] [CrossRef]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning; AAAI Press: Washington, DC, USA, 2017; pp. 4278–4284. [Google Scholar]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
- Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global Second-Order Pooling Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3019–3028. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 9423–9433. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
- Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks With Spatial and Channel “Squeeze and Excitation” Blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1971–1980. [Google Scholar]
- Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 350–359. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Image Classification | Object Detection | |
---|---|---|
Dataset | Mini ImageNet | VOC 2007 |
Epochs | 24 | 300 |
Batch size | 256 | 256 |
Image size | 224 × 224 | 640 × 640 |
Optimizer | SGD | SGD |
Initial learning rate | 0.01 | 0.1 |
Final learning rate | 0.001 | 0.01 |
Momentum | 0.9 | 0.9 |
Weight decay | 1×10−4 | 4×10−5 |
Method | Module | #.Param. | Top-1 | Top-5 | # FLOPs |
---|---|---|---|---|---|
Baseline | 26.26 M | 54.81% | 80.79% | 8.533 G | |
+GAP | 26.26 M | 61.14% | 84.04% | 8.545 G | |
+GMP | ResNet-50 | 26.26 M | 54.10% | 80.05% | 8.534 G |
+GAP, GMP | 26.26 M | 54.62% | 80.12% | 8.538 G | |
Baseline | 45.25 M | 46.12% | 75.72% | 0.4241 G | |
+GAP | 45.25 M | 48.57% | 77.56% | 0.4249 G | |
+GMP | YOLOv8-cls | 45.25 M | 44.73% | 74.56% | 0.4242 G |
+GAP, GMP | 45.25 M | 46.02% | 75.58% | 0.4245 G |
Kernel Size | Module | #.Param. | Top-1 | Top-5 | # FLOPs |
---|---|---|---|---|---|
Baseline | 1.566 M | 46.12% | 75.72% | 0.4241 G | |
K = (3, 3) | 1.566 M | 47.59% | 76.57% | 0.4248 G | |
K = (3, 5) | 1.566 M | 47.60% | 76.86% | 0.4249 G | |
K = (3, 7) | 1.566 M | 47.08% | 76.16% | 0.4250 G | |
K = (5, 3) | 1.566 M | 46.80% | 76.08% | 0.4248 G | |
K = (5, 5) | YOLOv8-cls | 1.566 M | 48.57% | 77.56% | 0.4249 G |
K = (5, 7) | 1.566 M | 48.53% | 77.03% | 0.4250 G | |
K = (7, 3) | 1.566 M | 47.91% | 76.81% | 0.4248 G | |
K = (7, 5) | 1.566 M | 46.99% | 76.49% | 0.4249 G | |
K = (7, 7) | 1.566 M | 46.93% | 75.93% | 0.4250 G |
Method | Backbone Module | #.Param. | Top-1 | Top-5 | # FLOPs | + FLOPs | Time |
---|---|---|---|---|---|---|---|
Baseline | ResNet-50 | 26.26 M | 54.81% | 80.79% | 8.533G | 0 | 12.55 ms |
+SE | 26.43 M | 58.57% | 82.81% | 8.553G | +0.23% | 13.26 ms | |
+CBAM | 27.63 M | 59.38% | 83.33% | 8.672G | +1.63% | 15.43 ms | |
+CA | 26.39 M | 57.07% | 81.83% | 8.585G | +0.61% | 14.57 ms | |
+ECA | 26.26 M | 59.74% | 83.69% | 8.537G | +0.05% | 12.76 ms | |
+EMA | 26.27 M | 54.58% | 80.14% | 8.917G | +4.50% | 15.08 ms | |
+ESCA | 26.26 M | 61.14% | 84.04% | 8.545G | +0.14% | 13.10 ms | |
Baseline | ResNet-101 | 45.25 M | 57.32% | 81.56% | 15.999G | 0 | 21.53 ms |
+SE | 45.42 M | 59.73% | 83.40% | 16.019G | +0.13% | 22.56 ms | |
+CBAM | 46.63 M | 59.93% | 83.42% | 16.137G | +0.86% | 23.16 ms | |
+CA | 45.39 M | 58.15% | 82.18% | 16.051G | +0.33% | 22.89 ms | |
+ECA | 45.25 M | 60.65% | 83.69% | 16.002G | +0.02% | 21.96 ms | |
+EMA | 45.27 M | 56.91% | 80.87% | 16.381G | +2.51% | 22.97 ms | |
+ESCA | 45.25 M | 61.48% | 84.03% | 16.01G | +0.07% | 22.37 ms | |
Baseline | 0.782 M | 39.24% | 68.75% | 122.811M | 0 | 10.33 ms | |
+SE | 0.782 M | 40.36% | 70.04% | 122.832M | +0.02% | 10.79 ms | |
+CBAM | 0.783 M | 40.49% | 70.05% | 122.953M | +0.12% | 11.08 ms | |
+ECA | MobileNetv3 | 0.782M | 40.48% | 70.08% | 122.836M | +0.02% | 10.74 ms |
+EMA | 0.782 M | 39.14% | 68.37% | 123.094M | +0.23% | 11.01 ms | |
+ESCA | 0.782 M | 41.33% | 70.28% | 122.909M | +0.08% | 10.76 ms | |
Baseline | Yolov8-cls | 1.566 M | 46.12% | 75.72% | 0.4241G | 0 | 6.38 ms |
+SE | 1.569 M | 46.99% | 76.20% | 0.4246G | +0.12% | 6.67 ms | |
+CBAM | 1.587 M | 46.64% | 76.38% | 0.4266G | +0.59% | 7.38 ms | |
+CA | 1.571 M | 47.45% | 76.37% | 0.4263G | +0.52% | 8.65 ms | |
+ECA | 1.566 M | 46.73% | 76.69% | 0.4243G | +0.05% | 7.01 ms | |
+EMA | 1.567 M | 45.81% | 75.16% | 0.4289G | +1.13% | 7.90 ms | |
+ESCA | 1.566 M | 48.57% | 77.56% | 0.4249G | +0.19% | 7.23 ms |
Method | Module | #.Param. | Top-1 | Top-5 | # FLOPs |
---|---|---|---|---|---|
Baseline | ResNet-50 | 26.14 M | 85.49% | 99.32% | 8.522 G |
+SE | 26.32 M | 86.40% | 99.45% | 8.542 G | |
+CBAM | 27.52 M | 86.13% | 99.36% | 8.661 G | |
+ECA | 26.14 M | 86.39% | 99.44% | 8.526 G | |
+ESCA | 26.14 M | 86.43% | 99.47% | 8.534 G | |
Baseline | YOLOv8-cls | 1.451 M | 81.93% | 99.00% | 0.413 G |
+SE | 1.454 M | 82.21% | 98.99% | 0.413 G | |
+CBAM | 1.472 M | 82.27% | 99.22% | 0.415 G | |
+ECA | 1.451 M | 82.45% | 99.11% | 0.413 G | |
+ESCA | 1.451 M | 82.52% | 99.22% | 0.414 G |
Method | Module | #.Param. | # FLOPs | mAP(0.5) | mAP(0.5:0.95) |
---|---|---|---|---|---|
Baseline | ResNet-50 | 49.29 M | 18.62 G | 57.30% | 34.82% |
+SE | 49.38 M | 18.64 G | 57.91% | 35.14% | |
+CBAM | 50.29 M | 18.69 G | 57.85% | 35.11% | |
+ECA | 49.29 M | 18.62 G | 58.87% | 35.41% | |
+ESCA | 49.29 M | 18.63 G | 59.24% | 36.03% | |
Baseline | YOLOv8n | 3.010 M | 0.993G | 48.60% | 29.32% |
+SE | 3.023 M | 1.007 G | 49.12% | 29.55% | |
+CBAM | 3.223 M | 1.092 G | 49.01% | 29.51% | |
+ECA | 3.015 M | 1.006 G | 49.56% | 29.91% | |
+ESCA | 3.157 M | 1.086 G | 50.27% | 30.04% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, H.; Zhang, Y.; Chen, Y. A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks. Symmetry 2024, 16, 952. https://doi.org/10.3390/sym16080952
Liu H, Zhang Y, Chen Y. A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks. Symmetry. 2024; 16(8):952. https://doi.org/10.3390/sym16080952
Chicago/Turabian StyleLiu, Huaiyu, Yueyuan Zhang, and Yiyang Chen. 2024. "A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks" Symmetry 16, no. 8: 952. https://doi.org/10.3390/sym16080952
APA StyleLiu, H., Zhang, Y., & Chen, Y. (2024). A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks. Symmetry, 16(8), 952. https://doi.org/10.3390/sym16080952