Abstract
In the task of video salient object detection, how to effectively fuse spatiotemporal cues is the key to successfully detecting salient objects. Existing methods suffer from inadequate fusion as well as focusing too much on a single piece of information, which makes them perform poorly in complex scenes. To address these issues, we propose a new spatiotemporal full-stage interaction enhancement network (FIE-Net) for video salient object detection. FIE-Net applies spatiotemporal information interaction deeply to the encoder–decoder stage, fully exploring the complementarity of spatiotemporal modalities. Specifically, we introduce a progressive attention guidance unit in the encoder part, which can adaptively fuse spatiotemporal features under a progressive structure for efficient interaction of spatiotemporal information. In the decoder part, we incorporate a cross-modal global refinement unit, which utilizes spatiotemporal global features to refine and complement the encoder features to obtain more complete salient information. In addition, we employ a multilevel information correction unit to further filter the input features using spatial low-level features and optical flow prediction maps to obtain more accurate salient information. We conducted experiments on four dataset benchmarks. The experimental results show that our method is highly competitive with current state-of-the-art algorithms.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are not publicly available due to [REASON(S) WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.
References
Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention Siamese networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3623–3632 (2019)
Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S.C.H., Ling, H.: Learning unsupervised video object segmentation through visual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3064–3074 (2019)
Wang, W., Shen, J., Yang, R., Porikli, F.: Saliency-aware video object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 20–33 (2017)
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 6, 66 (2023)
Li, X., Chang, W., Huang, L., Wei, S., He, G., Li, Y., Lai, X.: Towards coding for vod application: an enhanced video compression system with a content-fitted recursive restoration network. Digit. Signal Process. 122, 103368 (2022)
Haidar Sharif, Md.: A numerical approach for tracking unknown number of individual targets in videos. Digit. Signal Process. 57, 106–127 (2016)
Yang, Z., Chen, Y., Yang, Y., Chen, Y.: Robust feature mining transformer for occluded person re-identification. Digit. Signal Process. 141, 104166 (2023)
Li, H., Chen, G., Li, G., Yu, Y.: Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7274–7283 (2019)
Liu, J., Wang, J., Wang, W., Yuting, S.: Ds-net: dynamic spatiotemporal network for video salient object detection. Digit. Signal Process. 130, 103700 (2022)
Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. 27(1), 38–49 (2017)
Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.-M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 715–731 (2018)
Fan, D.-P., Wang, W., Cheng, M.-M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8554–8564 (2019)
Yuchao, G., Wang, L., Wang, Z, Liu, Y, Cheng, M-M, Shao-Ping, L.: Pyramid constrained self-attention network for fast video salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence vol. 34(07), pp. 10869–10876 (2020)
Chen, C., Wang, G., Peng, C., Fang, Y., Zhang, D., Qin, H.: Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans. Image Process. 30, 3995–4007 (2021)
Chen, P., Lai, J., Wang, G., Zhou, H.: Confidence-guided adaptive gate and dual differential enhancement for video salient object detection. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Li, G., Yizhou, Y.: Visual saliency detection based on multiscale deep cnn features. IEEE Trans. Image Process. 25(11), 5012–5024 (2016)
Han, J., Zhang, D., Xintao, H., Guo, L., Ren, J., Feng, W.: Background prior-based salient object detection via deep reconstruction residual. IEEE Trans. Circuits Syst. Video Technol. 25(8), 1309–1321 (2014)
Wang, L., Lu, H., Ruan, X., Yang, M.-H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–3192 (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Liu, N., Han, J.: Dhsnet: deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686 (2016)
Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: a novel approach to saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3135 (2018)
Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3203–3212 (2017)
Liu, J.-J., Hou, Q., Cheng, M.-M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3917–3926 (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference. Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, Berlin (2015)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Wang, W., Shen, J., Dong, X., Borji, A.: Salient object detection driven by fixation prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1711–1720 (2018)
Wang, W., Shen, J., Dong, X., Borji, A., Yang, R.: Inferring salient objects from human fixations. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1913–1927 (2019)
Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)
Liu, N., Han, J., Yang, M.-H.: Picanet: pixel-wise contextual attention learning for accurate saliency detection. IEEE Trans. Image Process. 29, 6438–6451 (2020)
Cong, R., Yang, N., Li, C., Huazhu, F., Zhao, Y., Huang, Q., Kwong, S.: Global-and-local collaborative learning for co-salient object detection. IEEE Trans. Cybernet. 53(3), 1920–1931 (2022)
Li, L., Han, J., Zhang, N., Liu, N., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S.: Discriminative co-saliency and background mining transformer for co-salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7247–7256 (2023)
Zhao, Z., Yang, Q., Yang, S., Wang, J.: Depth guided cross-modal residual adaptive network for rgb-d salient object detection. J. Phys. Conf. Ser. 1873(1), 012024 (2021)
Wang, J., Yang, Q., Yang, S., Chai, X., Zhang, W.: Dual-path processing network for high-resolution salient object detection. Appl. Intell. 52(10), 12034–12048 (2022)
Wang, J., Zhao, Z., Yang, S., Chai, X., Zhang, W., Zhang, M.: Global contextual guided residual attention network for salient object detection. Appl. Intell. 66, 1–19 (2022)
Zhigang, T., Guo, Z., Xie, W., Yan, M., Veltkamp, R.C., Li, B., Yuan, J.: Fusing disparate object signatures for salient object detection in video. Pattern Recognit. 72, 285–299 (2017)
Chen, C., Li, S., Wang, Y., Qin, H., Hao, A.: Video saliency detection via spatial–temporal fusion and low-rank coherency diffusion. IEEE Trans. Image Process. 26(7), 3156–3170 (2017)
Guo, F., Wang, W., Shen, J., Shao, L., Yang, J., Tao, D., Tang, Y.Y.: Video saliency detection using object proposals. IEEE Trans. Cybernet. 48(11), 3159–3170 (2017)
Li, G., Xie, Y., Wei, T., Wang, K., Lin, L.: Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3243–3252 (2018)
Ren, S., Han, C., Yang, X., Han, G., He, S.: Tenet: triple excitation network for video salient object detection. In: Computer Vision—ECCV 2020: 16th European Conference. Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 212–228. Springer, Berlin (2020)
Ji, G.-P., Fu, K., Wu, Z., Fan, D.-P., Shen, J., Shao, L.: Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4922–4933 (2021)
Gao, S., Xing, H., Zhang, W., Wang, Y., Guo, Q., Zhang, W.: Weakly supervised video salient object detection via point supervision. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3656–3665 (2022)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470 (2017)
Li, G., Xie, Y., Lin, L., Yu, Y.: Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2386–2395 (2017)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489 (2019)
De Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134, 19–67 (2005)
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600 (2004)
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)
Li, J., Xia, C., Chen, X.: A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans. Image Process. 27(1), 349–364 (2017)
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)
Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740. IEEE (2012)
Fan, D.-P., Ji, G.-P., Sun, G., Cheng, M.-M., Shen, J., Shao, L.: Camouflaged object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2777–2787 (2020)
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tu, W.-C., He, S., Yang, Q., Chien, S.-Y.: Real-time salient object detection with a minimum spanning tree. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2334–2342 (2016)
Liu, Z., Li, J., Ye, L., Sun, G., Shen, L.: Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans. Circuits Syst. Video Technol. 27(12), 2527–2542 (2016)
Xi, T., Zhao, W., Wang, H., Lin, W.: Salient object detection with spatiotemporal background priors for video. IEEE Trans. Image Process. 26(7), 3425–3436 (2016)
Chen, Y., Zou, W., Tang, Y., Li, X., Chen, Xu., Komodakis, N.: Scom: spatiotemporal constrained optimization for salient object detection. IEEE Trans. Image Process. 27(7), 3345–3357 (2018)
Yan, P., Li, G., Xie, Y., Li, Z., Wang, C., Chen, T., Lin, L.: Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7284–7293 (2019)
Mei, J., Wang, M., Lin, Y.-Y., Liu, Y.: Transvos: video object segmentation with transformers. arXiv:2106.00588 (2021)
Piao, Y., Lu, C., Zhang, M., Lu, H.: Semi-supervised video salient object detection based on uncertainty-guided pseudo labels. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.) Advances in Neural Information Processing Systems, volume 35, pp. 5614–5627. Curran Associates, Inc. (2022)
Tang, Y., Zou, W., Jin, Z., Chen, Y., Hua, Y., Li, X.: Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans. Circuits Syst. Video Technol. 29(7), 1973–1984 (2018)
Li, S., Seybold, B., Vorobyov, A., Lei, X., Kuo, C.-C.J.: Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 207–223 (2018)
Ji, Y., Zhang, H., Jie, Z., Ma, L., Jonathan, A., Wu, Q.M.: Casnet: a cross-attention Siamese network for video salient object detection. IEEE Trans. Neural Netw. Learn. Syst. 32(6), 2676–2690 (2020)
Chen, B., Chen, Z., Xiao, H., Jun, X., Xie, H., Qin, J., Wei, M.: Dynamic message propagation network for rgb-d and video salient object detection. ACM Trans. Multimedia Comput. Commun. Appl. 20, 1–21 (2023)
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Computer Vision—ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, 2004. Proceedings, Part IV 8, pp. 25–36. Springer, Berlin (2004)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Jun Wang, Chenhao Sun, Haoyu Wang, Xing Ren, and Ziqing Huang. The first draft of the manuscript was written by Xiaoli Li and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “FIE-Net: Spatiotemporal Full-stage Interaction Enhancement network for Video Salient Object Detection”.
Ethical and informed consent
The data used did not involve human participants and animal studies.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the National Natural Science Foundation of China Youth Fund (No. 62202142) and the Scientific Research Key Foundation of Higher Education Institutions of Henan Province (No. 23A520025).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Sun, C., Wang, H. et al. Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection. SIViP 18, 6321–6337 (2024). https://doi.org/10.1007/s11760-024-03319-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-024-03319-6