[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Coherence-aware context aggregator for fast video object segmentation

Published: 01 April 2023 Publication History

Highlights

We re-analyze the generation and utilization mechanism of temporal context in the semi-supervised VOS task and introduce the efficient tracklet data flow, based on which we propose a novel VOS model and achieve real-time running speed. Experimental results on three benchmark datasets show that our proposed model achieves a better trade-off between efficiency and accuracy.
We design a coherence-aware module to estimate the coherence of the predicted target and update the temporal context in a more robust way.
We devise a spatio-temporal context aggregation module to aggregate the spatial context and temporal context at each level of the decoder to alleviate the aliasing effect, and learn a robust and discriminative feature representation for the target object.

Abstract

Semi-supervised video object segmentation (VOS) is a highly challenging problem that has attracted much research attention in recent years. Temporal context plays an important role in VOS by providing object clues from the past frames. However, most of the prevailing methods directly use the predicted temporal results to guide the segmentation of the current frame, while ignoring the coherence of temporal context, which may be misleading and degrade the performance. In this paper, we propose a novel model named Coherence-aware Context Aggregator (CCA) for VOS, which consists of three modules. First, a coherence-aware module (CAM) is proposed to evaluate the coherence of the predicted result of the current frame and then fuses the coherent features to update the temporal context. CAM can determine whether the prediction is accurate, thus guiding the update of the temporal context and avoiding the introduction of erroneous information. Second, we devise a spatio-temporal context aggregation (STCA) module to aggregate the temporal context with the spatial feature of the current frame to learn a robust and discriminative target representation in the decoder part. Third, we design a refinement module to refine the coarse feature generated from the STCA module for more precise segmentation. Additionally, CCA uses a cropping strategy and takes small-size images as input, thus making it computationally efficient and achieving a real-time running speed. Extensive experiments on four challenging benchmarks show that CCA achieves a better trade-off between efficiency and accuracy compared to state-of-the-art methods. The code will be public.

References

[1]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L.V. Gool, M.H. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, CVPR, 2016, pp. 724–732.
[2]
S. Wan, S. Ding, C. Chen, Edge computing enabled video segmentation for real-time traffic monitoring in internet of vehicles, Pattern Recognit 121 (2022) 108146.
[3]
M. Sun, J. Xiao, E.G. Lim, Y. Xie, J. Feng, Adaptive ROI generation for video object segmentation using reinforcement learning, Pattern Recognit 106 (2020) 107465.
[4]
J. Zhang, D. Tao, Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things, IEEE Internet Things J. 8 (10) (2020) 7789–7817.
[5]
Q. Wang, L. Zhang, L. Bertinetto, W. Hu, P.H.S. Torr, Fast online object tracking and segmentation: A unifying approach, CVPR, 2019, pp. 1328–1338.
[6]
P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, L.-C. Chen, FEELVOS: fast end-to-end embedding learning for video object segmentation, CVPR, 2019, pp. 9481–9490.
[7]
S.W. Oh, J.-Y. Lee, N. Xu, S.J. Kim, Video object segmentation using space-time memory networks, ICCV, 2019, pp. 9225–9234.
[8]
X. Chen, Z. Li, Y. Yuan, G. Yu, J. Shen, D. Qi, State-aware tracker for real-time video object segmentation, CVPR, 2020, pp. 9381–9390.
[9]
S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L.V. Gool, One-shot video object segmentation, CVPR, 2017, pp. 5320–5329.
[10]
P. Voigtlaender, B. Leibe, Online adaptation of convolutional neural networks for video object segmentation, BMVC, 2017.
[11]
J. Luiten, P. Voigtlaender, B. Leibe, Premvos: Proposal-generation, refinement and merging for video object segmentation, ACCV, 2018, pp. 565–580.
[12]
A. Robinson, F.J. Lawin, M. Danelljan, F.S. Khan, M. Felsberg, Learning fast and robust target models for video object segmentation, CVPR, 2020, pp. 7404–7413.
[13]
X. Lu, W. Wang, D. Martin, T. Zhou, J. Shen, V.G. Luc, Video object segmentation with episodic graph memory networks, ECCV, 2020, pp. 661–679.
[14]
G. Bhat, F.J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L.V. Gool, R. Timofte, Learning what to learn for video object segmentation, ECCV, 2020, pp. 777–794.
[15]
Y. Yin, D. Xu, X. Wang, L. Zhang, AGUnet: annotation-guided u-net for fast one-shot video object segmentation, Pattern Recognit 110 (2021) 107580.
[16]
Z. Zhao, S. Zhao, J. Shen, Real-time and light-weighted unsupervised video object segmentation network, Pattern Recognit 120 (2021) 108120.
[17]
S.W. Oh, J.-Y. Lee, K. Sunkavalli, S.J. Kim, Fast video object segmentation by reference-guided mask propagation, CVPR, 2018, pp. 7376–7385.
[18]
Z. Wang, J. Xu, L. Liu, F. Zhu, L. Shao, Ranet: Ranking attention network for fast video object segmentation, ICCV, 2019, pp. 3977–3986.
[19]
H. Seong, S.W. Oh, J.-Y. Lee, S. Lee, S. Lee, E. Kim, Hierarchical memory matching network for video object segmentation, ICCV, 2021, pp. 12889–12898.
[20]
H.K. Cheng, Y.-W. Tai, C.-K. Tang, Rethinking space-time networks with improved memory coverage for efficient video object segmentation, NIPS 34 (2021) 11781–11794.
[21]
M. Lan, J. Zhang, F. He, L. Zhang, Siamese network with interactive transformer for video object segmentation, AAAI, 2022, pp. 1228–1236.
[22]
Y. Li, Z. Shen, Y. Shan, Fast video object segmentation using the global context module, ECCV, 2020, pp. 735–750.
[23]
X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, F. Porikli, See more, know more: Unsupervised video object segmentation with co-attention siamese networks, CVPR, 2019, pp. 3623–3632.
[24]
M. Lan, Y. Zhang, Q. Xu, L. Zhang, E3SN: efficient end-to-end siamese network for video object segmentation, IJCAI, 2020, pp. 701–707.
[25]
J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, CVPR, 2015, pp. 3431–3440.
[26]
J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, J. Jiang, A simple pooling-based design for real-time salient object detection, CVPR, 2019, pp. 3917–3926.
[27]
J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, L. Van Gool, The 2017 davis challenge on video object segmentation, arXiv preprint arXiv:1704.00675 (2017).
[28]
N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B.L. Price, S. Cohen, T.S. Huang, Youtube-VOS: sequence-to-sequence video object segmentation, ECCV, 2018, pp. 603–619.
[29]
Y. Xu, Z. Wang, Z. Li, Y. Ye, G. Yu, SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines, AAAI, 2020, pp. 12549–12556.
[30]
K.-. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van Gool, Video object segmentation without temporal information, IEEE TPAMI. 41 (6) (2019) 1515–1530.
[31]
L. Hu, P. Zhang, B. Zhang, P. Pan, Y. Xu, R. Jin, Learning position and target consistency for memory-based video object segmentation, CVPR, 2021, pp. 4144–4154.
[32]
H. Xie, H. Yao, S. Zhou, S. Zhang, W. Sun, Efficient regional memory network for video object segmentation, CVPR, 2021, pp. 1286–1295.
[33]
Y. Mao, N. Wang, W. Zhou, H. Li, Joint inductive and transductive learning for video object segmentation, ICCV, 2021, pp. 9670–9679.
[34]
Z. Yang, Y. Wei, Y. Yang, Associating objects with transformers for video object segmentation, NIPS 34 (2021) 2491–2502.
[35]
J. Johnander, M. Danelljan, E. Brissman, F.S. Khan, M. Felsberg, A generative appearance model for end-to-end video object segmentation, CVPR, 2019, pp. 8953–8962.
[36]
X. Huang, J. Xu, Y.-W. Tai, C.-K. Tang, Fast video object segmentation with temporal aggregation network and dynamic template matching, CVPR, 2020, pp. 8876–8886.
[37]
S. Cho, H. Lee, M. Kim, S. Jang, S. Lee, Pixel-level bijective matching for video object segmentation, WACV, 2022, pp. 129–138.
[38]
J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, M.-H. Yang, Fast and accurate online video object segmentation via tracking parts, CVPR, 2018, pp. 7415–7424.
[39]
Y. Zhang, Z. Wu, H. Peng, S. Lin, A transductive approach for video object segmentation, CVPR, 2020, pp. 6949–6958.
[40]
Y. Liu, R. Yu, X. Zhao, Y. Yang, Quality-aware and selective prior enhancement memory network for video object segmentation, Technical Report of 2021 YouTube-VOS Challenge.

Cited By

View all
  • (2024)MVTr: multi-feature voxel transformer for 3D object detectionThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-02860-840:3(1453-1466)Online publication date: 1-Mar-2024
  • (2024)Depression Diagnosis and Analysis via Multimodal Multi-order Factor FusionArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72353-7_5(56-70)Online publication date: 17-Sep-2024

Index Terms

  1. Coherence-aware context aggregator for fast video object segmentation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Pattern Recognition
          Pattern Recognition  Volume 136, Issue C
          Apr 2023
          858 pages

          Publisher

          Elsevier Science Inc.

          United States

          Publication History

          Published: 01 April 2023

          Author Tags

          1. Video object segmentation
          2. Semi-supervised learning
          3. Spatio-temporal representation
          4. Context

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 24 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)MVTr: multi-feature voxel transformer for 3D object detectionThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-02860-840:3(1453-1466)Online publication date: 1-Mar-2024
          • (2024)Depression Diagnosis and Analysis via Multimodal Multi-order Factor FusionArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72353-7_5(56-70)Online publication date: 17-Sep-2024

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media