Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection

Jun Wang¹,
Chenhao Sun¹,
Haoyu Wang¹,
Xing Ren¹,
Ziqing Huang¹ &
…
Xiaoli Li¹

200 Accesses
Explore all metrics

Abstract

In the task of video salient object detection, how to effectively fuse spatiotemporal cues is the key to successfully detecting salient objects. Existing methods suffer from inadequate fusion as well as focusing too much on a single piece of information, which makes them perform poorly in complex scenes. To address these issues, we propose a new spatiotemporal full-stage interaction enhancement network (FIE-Net) for video salient object detection. FIE-Net applies spatiotemporal information interaction deeply to the encoder–decoder stage, fully exploring the complementarity of spatiotemporal modalities. Specifically, we introduce a progressive attention guidance unit in the encoder part, which can adaptively fuse spatiotemporal features under a progressive structure for efficient interaction of spatiotemporal information. In the decoder part, we incorporate a cross-modal global refinement unit, which utilizes spatiotemporal global features to refine and complement the encoder features to obtain more complete salient information. In addition, we employ a multilevel information correction unit to further filter the input features using spatial low-level features and optical flow prediction maps to obtain more accurate salient information. We conducted experiments on four dataset benchmarks. The experimental results show that our method is highly competitive with current state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Local-Global Interaction and Progressive Aggregation for Video Salient Object Detection

Video salient object detection via self-attention-guided multilayer cross-stack fusion

Article 15 November 2023

DSFNet: dynamic selection-fusion networks for video salient object detection

Article 16 November 2023

Data availability

The datasets generated during and/or analysed during the current study are not publicly available due to [REASON(S) WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.

References

Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention Siamese networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632 (2019)
Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S.C.H., Ling, H.: Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3064–3074 (2019)
Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 20–33 (2017)
Article Google Scholar
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 6, 66 (2023)
Google Scholar
Li, X., Chang, W., Huang, L., Wei, S., He, G., Li, Y., Lai, X.: Towards coding for vod application: an enhanced video compression system with a content-fitted recursive restoration network. Digit. Signal Process. 122, 103368 (2022)
Article Google Scholar
Haidar Sharif, Md.: A numerical approach for tracking unknown number of individual targets in videos. Digit. Signal Process. 57, 106–127 (2016)
Yang, Z., Chen, Y., Yang, Y., Chen, Y.: Robust feature mining transformer for occluded person re-identification. Digit. Signal Process. 141, 104166 (2023)
Article Google Scholar
Li, H., Chen, G., Li, G., Yu, Y.: Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7274–7283 (2019)
Liu, J., Wang, J., Wang, W., Yuting, S.: Ds-net: dynamic spatiotemporal network for video salient object detection. Digit. Signal Process. 130, 103700 (2022)
Article Google Scholar
Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2017)
Article MathSciNet Google Scholar
Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 715–731 (2018)
Fan, D.-P., Wang, W., Cheng, M.-M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8554–8564 (2019)
Yuchao, G., Wang, L., Wang, Z, Liu, Y, Cheng, M-M, Shao-Ping, L.: Pyramid constrained self-attention network for fast video salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence vol. 34(07), pp. 10869–10876 (2020)
Chen, C., Wang, G., Peng, C., Fang, Y., Zhang, D., Qin, H.: Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans. Image Process. 30, 3995–4007 (2021)
Article Google Scholar
Chen, P., Lai, J., Wang, G., Zhou, H.: Confidence-guided adaptive gate and dual differential enhancement for video salient object detection. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Li, G., Yizhou, Y.: Visual saliency detection based on multiscale deep cnn features. IEEE Trans. Image Process. 25(11), 5012–5024 (2016)
Article MathSciNet Google Scholar
Han, J., Zhang, D., Xintao, H., Guo, L., Ren, J., Feng, W.: Background prior-based salient object detection via deep reconstruction residual. IEEE Trans. Circuits Syst. Video Technol. 25(8), 1309–1321 (2014)
Google Scholar
Wang, L., Lu, H., Ruan, X., Yang, M.-H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–3192 (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Liu, N., Han, J.: Dhsnet: deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686 (2016)
Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: a novel approach to saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3135 (2018)
Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3203–3212 (2017)
Liu, J.-J., Hou, Q., Cheng, M.-M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3917–3926 (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference. Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, Berlin (2015)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Wang, W., Shen, J., Dong, X., Borji, A.: Salient object detection driven by fixation prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1711–1720 (2018)
Wang, W., Shen, J., Dong, X., Borji, A., Yang, R.: Inferring salient objects from human fixations. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1913–1927 (2019)
Article Google Scholar
Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)
Liu, N., Han, J., Yang, M.-H.: Picanet: pixel-wise contextual attention learning for accurate saliency detection. IEEE Trans. Image Process. 29, 6438–6451 (2020)
Article Google Scholar
Cong, R., Yang, N., Li, C., Huazhu, F., Zhao, Y., Huang, Q., Kwong, S.: Global-and-local collaborative learning for co-salient object detection. IEEE Trans. Cybernet. 53(3), 1920–1931 (2022)
Article Google Scholar
Li, L., Han, J., Zhang, N., Liu, N., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S.: Discriminative co-saliency and background mining transformer for co-salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7247–7256 (2023)
Zhao, Z., Yang, Q., Yang, S., Wang, J.: Depth guided cross-modal residual adaptive network for rgb-d salient object detection. J. Phys. Conf. Ser. 1873(1), 012024 (2021)
Article Google Scholar
Wang, J., Yang, Q., Yang, S., Chai, X., Zhang, W.: Dual-path processing network for high-resolution salient object detection. Appl. Intell. 52(10), 12034–12048 (2022)
Article Google Scholar
Wang, J., Zhao, Z., Yang, S., Chai, X., Zhang, W., Zhang, M.: Global contextual guided residual attention network for salient object detection. Appl. Intell. 66, 1–19 (2022)
Google Scholar
Zhigang, T., Guo, Z., Xie, W., Yan, M., Veltkamp, R.C., Li, B., Yuan, J.: Fusing disparate object signatures for salient object detection in video. Pattern Recognit. 72, 285–299 (2017)
Article Google Scholar
Chen, C., Li, S., Wang, Y., Qin, H., Hao, A.: Video saliency detection via spatial–temporal fusion and low-rank coherency diffusion. IEEE Trans. Image Process. 26(7), 3156–3170 (2017)
Article MathSciNet Google Scholar
Guo, F., Wang, W., Shen, J., Shao, L., Yang, J., Tao, D., Tang, Y.Y.: Video saliency detection using object proposals. IEEE Trans. Cybernet. 48(11), 3159–3170 (2017)
Article Google Scholar
Li, G., Xie, Y., Wei, T., Wang, K., Lin, L.: Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3243–3252 (2018)
Ren, S., Han, C., Yang, X., Han, G., He, S.: Tenet: triple excitation network for video salient object detection. In: Computer Vision—ECCV 2020: 16th European Conference. Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 212–228. Springer, Berlin (2020)
Ji, G.-P., Fu, K., Wu, Z., Fan, D.-P., Shen, J., Shao, L.: Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4922–4933 (2021)
Gao, S., Xing, H., Zhang, W., Wang, Y., Guo, Q., Zhang, W.: Weakly supervised video salient object detection via point supervision. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3656–3665 (2022)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470 (2017)
Li, G., Xie, Y., Lin, L., Yu, Y.: Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2386–2395 (2017)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489 (2019)
De Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134, 19–67 (2005)
Article MathSciNet Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600 (2004)
Article Google Scholar
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)
Li, J., Xia, C., Chen, X.: A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans. Image Process. 27(1), 349–364 (2017)
Article MathSciNet Google Scholar
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)
Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740. IEEE (2012)
Fan, D.-P., Ji, G.-P., Sun, G., Cheng, M.-M., Shen, J., Shao, L.: Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2777–2787 (2020)
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tu, W.-C., He, S., Yang, Q., Chien, S.-Y.: Real-time salient object detection with a minimum spanning tree. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2334–2342 (2016)
Liu, Z., Li, J., Ye, L., Sun, G., Shen, L.: Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans. Circuits Syst. Video Technol. 27(12), 2527–2542 (2016)
Article Google Scholar
Xi, T., Zhao, W., Wang, H., Lin, W.: Salient object detection with spatiotemporal background priors for video. IEEE Trans. Image Process. 26(7), 3425–3436 (2016)
Article MathSciNet Google Scholar
Chen, Y., Zou, W., Tang, Y., Li, X., Chen, Xu., Komodakis, N.: Scom: spatiotemporal constrained optimization for salient object detection. IEEE Trans. Image Process. 27(7), 3345–3357 (2018)
Article MathSciNet Google Scholar
Yan, P., Li, G., Xie, Y., Li, Z., Wang, C., Chen, T., Lin, L.: Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7284–7293 (2019)
Mei, J., Wang, M., Lin, Y.-Y., Liu, Y.: Transvos: video object segmentation with transformers. arXiv:2106.00588 (2021)
Piao, Y., Lu, C., Zhang, M., Lu, H.: Semi-supervised video salient object detection based on uncertainty-guided pseudo labels. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.) Advances in Neural Information Processing Systems, volume 35, pp. 5614–5627. Curran Associates, Inc. (2022)
Tang, Y., Zou, W., Jin, Z., Chen, Y., Hua, Y., Li, X.: Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans. Circuits Syst. Video Technol. 29(7), 1973–1984 (2018)
Article Google Scholar
Li, S., Seybold, B., Vorobyov, A., Lei, X., Kuo, C.-C.J.: Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 207–223 (2018)
Ji, Y., Zhang, H., Jie, Z., Ma, L., Jonathan, A., Wu, Q.M.: Casnet: a cross-attention Siamese network for video salient object detection. IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2676–2690 (2020)
Article Google Scholar
Chen, B., Chen, Z., Xiao, H., Jun, X., Xie, H., Qin, J., Wei, M.: Dynamic message propagation network for rgb-d and video salient object detection. ACM Trans. Multimedia Comput. Commun. Appl. 20, 1–21 (2023)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Computer Vision—ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, 2004. Proceedings, Part IV 8, pp. 25–36. Springer, Berlin (2004)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)

Download references

Author information

Authors and Affiliations

Henan University, Kaifeng, Henan, China
Jun Wang, Chenhao Sun, Haoyu Wang, Xing Ren, Ziqing Huang & Xiaoli Li

Authors

Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenhao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xing Ren
View author publications
You can also search for this author in PubMed Google Scholar
Ziqing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Jun Wang, Chenhao Sun, Haoyu Wang, Xing Ren, and Ziqing Huang. The first draft of the manuscript was written by Xiaoli Li and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaoli Li.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “FIE-Net: Spatiotemporal Full-stage Interaction Enhancement network for Video Salient Object Detection”.

Ethical and informed consent

The data used did not involve human participants and animal studies.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China Youth Fund (No. 62202142) and the Scientific Research Key Foundation of Higher Education Institutions of Henan Province (No. 23A520025).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Sun, C., Wang, H. et al. Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection. SIViP 18, 6321–6337 (2024). https://doi.org/10.1007/s11760-024-03319-6

Download citation

Received: 13 April 2024
Revised: 17 May 2024
Accepted: 25 May 2024
Published: 17 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11760-024-03319-6

Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Local-Global Interaction and Progressive Aggregation for Video Salient Object Detection

Video salient object detection via self-attention-guided multilayer cross-stack fusion

DSFNet: dynamic selection-fusion networks for video salient object detection

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical and informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Local-Global Interaction and Progressive Aggregation for Video Salient Object Detection

Video salient object detection via self-attention-guided multilayer cross-stack fusion

DSFNet: dynamic selection-fusion networks for video salient object detection

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical and informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation