[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

In the task of video salient object detection, how to effectively fuse spatiotemporal cues is the key to successfully detecting salient objects. Existing methods suffer from inadequate fusion as well as focusing too much on a single piece of information, which makes them perform poorly in complex scenes. To address these issues, we propose a new spatiotemporal full-stage interaction enhancement network (FIE-Net) for video salient object detection. FIE-Net applies spatiotemporal information interaction deeply to the encoder–decoder stage, fully exploring the complementarity of spatiotemporal modalities. Specifically, we introduce a progressive attention guidance unit in the encoder part, which can adaptively fuse spatiotemporal features under a progressive structure for efficient interaction of spatiotemporal information. In the decoder part, we incorporate a cross-modal global refinement unit, which utilizes spatiotemporal global features to refine and complement the encoder features to obtain more complete salient information. In addition, we employ a multilevel information correction unit to further filter the input features using spatial low-level features and optical flow prediction maps to obtain more accurate salient information. We conducted experiments on four dataset benchmarks. The experimental results show that our method is highly competitive with current state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are not publicly available due to [REASON(S) WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.

References

  1. Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention Siamese networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632 (2019)

  2. Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S.C.H., Ling, H.: Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3064–3074 (2019)

  3. Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 20–33 (2017)

    Article  Google Scholar 

  4. Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 6, 66 (2023)

    Google Scholar 

  5. Li, X., Chang, W., Huang, L., Wei, S., He, G., Li, Y., Lai, X.: Towards coding for vod application: an enhanced video compression system with a content-fitted recursive restoration network. Digit. Signal Process. 122, 103368 (2022)

    Article  Google Scholar 

  6. Haidar Sharif, Md.: A numerical approach for tracking unknown number of individual targets in videos. Digit. Signal Process. 57, 106–127 (2016)

  7. Yang, Z., Chen, Y., Yang, Y., Chen, Y.: Robust feature mining transformer for occluded person re-identification. Digit. Signal Process. 141, 104166 (2023)

    Article  Google Scholar 

  8. Li, H., Chen, G., Li, G., Yu, Y.: Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7274–7283 (2019)

  9. Liu, J., Wang, J., Wang, W., Yuting, S.: Ds-net: dynamic spatiotemporal network for video salient object detection. Digit. Signal Process. 130, 103700 (2022)

    Article  Google Scholar 

  10. Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2017)

    Article  MathSciNet  Google Scholar 

  11. Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 715–731 (2018)

  12. Fan, D.-P., Wang, W., Cheng, M.-M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8554–8564 (2019)

  13. Yuchao, G., Wang, L., Wang, Z, Liu, Y, Cheng, M-M, Shao-Ping, L.: Pyramid constrained self-attention network for fast video salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence vol. 34(07), pp. 10869–10876 (2020)

  14. Chen, C., Wang, G., Peng, C., Fang, Y., Zhang, D., Qin, H.: Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans. Image Process. 30, 3995–4007 (2021)

    Article  Google Scholar 

  15. Chen, P., Lai, J., Wang, G., Zhou, H.: Confidence-guided adaptive gate and dual differential enhancement for video salient object detection. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)

  16. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  17. Li, G., Yizhou, Y.: Visual saliency detection based on multiscale deep cnn features. IEEE Trans. Image Process. 25(11), 5012–5024 (2016)

    Article  MathSciNet  Google Scholar 

  18. Han, J., Zhang, D., Xintao, H., Guo, L., Ren, J., Feng, W.: Background prior-based salient object detection via deep reconstruction residual. IEEE Trans. Circuits Syst. Video Technol. 25(8), 1309–1321 (2014)

    Google Scholar 

  19. Wang, L., Lu, H., Ruan, X., Yang, M.-H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–3192 (2015)

  20. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  21. Liu, N., Han, J.: Dhsnet: deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686 (2016)

  22. Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: a novel approach to saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3135 (2018)

  23. Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3203–3212 (2017)

  24. Liu, J.-J., Hou, Q., Cheng, M.-M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3917–3926 (2019)

  25. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference. Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, Berlin (2015)

  26. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  27. Wang, W., Shen, J., Dong, X., Borji, A.: Salient object detection driven by fixation prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1711–1720 (2018)

  28. Wang, W., Shen, J., Dong, X., Borji, A., Yang, R.: Inferring salient objects from human fixations. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1913–1927 (2019)

    Article  Google Scholar 

  29. Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)

  30. Liu, N., Han, J., Yang, M.-H.: Picanet: pixel-wise contextual attention learning for accurate saliency detection. IEEE Trans. Image Process. 29, 6438–6451 (2020)

    Article  Google Scholar 

  31. Cong, R., Yang, N., Li, C., Huazhu, F., Zhao, Y., Huang, Q., Kwong, S.: Global-and-local collaborative learning for co-salient object detection. IEEE Trans. Cybernet. 53(3), 1920–1931 (2022)

    Article  Google Scholar 

  32. Li, L., Han, J., Zhang, N., Liu, N., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S.: Discriminative co-saliency and background mining transformer for co-salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7247–7256 (2023)

  33. Zhao, Z., Yang, Q., Yang, S., Wang, J.: Depth guided cross-modal residual adaptive network for rgb-d salient object detection. J. Phys. Conf. Ser. 1873(1), 012024 (2021)

    Article  Google Scholar 

  34. Wang, J., Yang, Q., Yang, S., Chai, X., Zhang, W.: Dual-path processing network for high-resolution salient object detection. Appl. Intell. 52(10), 12034–12048 (2022)

    Article  Google Scholar 

  35. Wang, J., Zhao, Z., Yang, S., Chai, X., Zhang, W., Zhang, M.: Global contextual guided residual attention network for salient object detection. Appl. Intell. 66, 1–19 (2022)

    Google Scholar 

  36. Zhigang, T., Guo, Z., Xie, W., Yan, M., Veltkamp, R.C., Li, B., Yuan, J.: Fusing disparate object signatures for salient object detection in video. Pattern Recognit. 72, 285–299 (2017)

    Article  Google Scholar 

  37. Chen, C., Li, S., Wang, Y., Qin, H., Hao, A.: Video saliency detection via spatial–temporal fusion and low-rank coherency diffusion. IEEE Trans. Image Process. 26(7), 3156–3170 (2017)

    Article  MathSciNet  Google Scholar 

  38. Guo, F., Wang, W., Shen, J., Shao, L., Yang, J., Tao, D., Tang, Y.Y.: Video saliency detection using object proposals. IEEE Trans. Cybernet. 48(11), 3159–3170 (2017)

    Article  Google Scholar 

  39. Li, G., Xie, Y., Wei, T., Wang, K., Lin, L.: Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3243–3252 (2018)

  40. Ren, S., Han, C., Yang, X., Han, G., He, S.: Tenet: triple excitation network for video salient object detection. In: Computer Vision—ECCV 2020: 16th European Conference. Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 212–228. Springer, Berlin (2020)

  41. Ji, G.-P., Fu, K., Wu, Z., Fan, D.-P., Shen, J., Shao, L.: Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4922–4933 (2021)

  42. Gao, S., Xing, H., Zhang, W., Wang, Y., Guo, Q., Zhang, W.: Weakly supervised video salient object detection via point supervision. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3656–3665 (2022)

  43. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470 (2017)

  44. Li, G., Xie, Y., Lin, L., Yu, Y.: Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2386–2395 (2017)

  45. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

  46. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  47. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  48. Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489 (2019)

  49. De Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134, 19–67 (2005)

    Article  MathSciNet  Google Scholar 

  50. Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600 (2004)

    Article  Google Scholar 

  51. Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)

  52. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)

  53. Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)

  54. Li, J., Xia, C., Chen, X.: A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans. Image Process. 27(1), 349–364 (2017)

    Article  MathSciNet  Google Scholar 

  55. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)

  56. Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)

  57. Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740. IEEE (2012)

  58. Fan, D.-P., Ji, G.-P., Sun, G., Cheng, M.-M., Shen, J., Shao, L.: Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2777–2787 (2020)

  59. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  60. Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)

  61. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  62. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  63. Tu, W.-C., He, S., Yang, Q., Chien, S.-Y.: Real-time salient object detection with a minimum spanning tree. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2334–2342 (2016)

  64. Liu, Z., Li, J., Ye, L., Sun, G., Shen, L.: Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans. Circuits Syst. Video Technol. 27(12), 2527–2542 (2016)

    Article  Google Scholar 

  65. Xi, T., Zhao, W., Wang, H., Lin, W.: Salient object detection with spatiotemporal background priors for video. IEEE Trans. Image Process. 26(7), 3425–3436 (2016)

    Article  MathSciNet  Google Scholar 

  66. Chen, Y., Zou, W., Tang, Y., Li, X., Chen, Xu., Komodakis, N.: Scom: spatiotemporal constrained optimization for salient object detection. IEEE Trans. Image Process. 27(7), 3345–3357 (2018)

    Article  MathSciNet  Google Scholar 

  67. Yan, P., Li, G., Xie, Y., Li, Z., Wang, C., Chen, T., Lin, L.: Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7284–7293 (2019)

  68. Mei, J., Wang, M., Lin, Y.-Y., Liu, Y.: Transvos: video object segmentation with transformers. arXiv:2106.00588 (2021)

  69. Piao, Y., Lu, C., Zhang, M., Lu, H.: Semi-supervised video salient object detection based on uncertainty-guided pseudo labels. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.) Advances in Neural Information Processing Systems, volume 35, pp. 5614–5627. Curran Associates, Inc. (2022)

  70. Tang, Y., Zou, W., Jin, Z., Chen, Y., Hua, Y., Li, X.: Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans. Circuits Syst. Video Technol. 29(7), 1973–1984 (2018)

    Article  Google Scholar 

  71. Li, S., Seybold, B., Vorobyov, A., Lei, X., Kuo, C.-C.J.: Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 207–223 (2018)

  72. Ji, Y., Zhang, H., Jie, Z., Ma, L., Jonathan, A., Wu, Q.M.: Casnet: a cross-attention Siamese network for video salient object detection. IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2676–2690 (2020)

    Article  Google Scholar 

  73. Chen, B., Chen, Z., Xiao, H., Jun, X., Xie, H., Qin, J., Wei, M.: Dynamic message propagation network for rgb-d and video salient object detection. ACM Trans. Multimedia Comput. Commun. Appl. 20, 1–21 (2023)

    Google Scholar 

  74. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Computer Vision—ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, 2004. Proceedings, Part IV 8, pp. 25–36. Springer, Berlin (2004)

  75. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Jun Wang, Chenhao Sun, Haoyu Wang, Xing Ren, and Ziqing Huang. The first draft of the manuscript was written by Xiaoli Li and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaoli Li.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “FIE-Net: Spatiotemporal Full-stage Interaction Enhancement network for Video Salient Object Detection”.

Ethical and informed consent

The data used did not involve human participants and animal studies.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China Youth Fund (No. 62202142) and the Scientific Research Key Foundation of Higher Education Institutions of Henan Province (No. 23A520025).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Sun, C., Wang, H. et al. Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection. SIViP 18, 6321–6337 (2024). https://doi.org/10.1007/s11760-024-03319-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-024-03319-6

Keywords

Navigation