research-article

Coherence-aware context aggregator for fast video object segmentation

Authors:

Meng Lan,

Jing Zhang,

Zengmao WangAuthors Info & Claims

Volume 136, Issue C

https://doi.org/10.1016/j.patcog.2022.109214

Published: 01 April 2023 Publication History

Highlights

•

We re-analyze the generation and utilization mechanism of temporal context in the semi-supervised VOS task and introduce the efficient tracklet data flow, based on which we propose a novel VOS model and achieve real-time running speed. Experimental results on three benchmark datasets show that our proposed model achieves a better trade-off between efficiency and accuracy.

•

We design a coherence-aware module to estimate the coherence of the predicted target and update the temporal context in a more robust way.

•

We devise a spatio-temporal context aggregation module to aggregate the spatial context and temporal context at each level of the decoder to alleviate the aliasing effect, and learn a robust and discriminative feature representation for the target object.

Abstract

Semi-supervised video object segmentation (VOS) is a highly challenging problem that has attracted much research attention in recent years. Temporal context plays an important role in VOS by providing object clues from the past frames. However, most of the prevailing methods directly use the predicted temporal results to guide the segmentation of the current frame, while ignoring the coherence of temporal context, which may be misleading and degrade the performance. In this paper, we propose a novel model named Coherence-aware Context Aggregator (CCA) for VOS, which consists of three modules. First, a coherence-aware module (CAM) is proposed to evaluate the coherence of the predicted result of the current frame and then fuses the coherent features to update the temporal context. CAM can determine whether the prediction is accurate, thus guiding the update of the temporal context and avoiding the introduction of erroneous information. Second, we devise a spatio-temporal context aggregation (STCA) module to aggregate the temporal context with the spatial feature of the current frame to learn a robust and discriminative target representation in the decoder part. Third, we design a refinement module to refine the coarse feature generated from the STCA module for more precise segmentation. Additionally, CCA uses a cropping strategy and takes small-size images as input, thus making it computationally efficient and achieving a real-time running speed. Extensive experiments on four challenging benchmarks show that CCA achieves a better trade-off between efficiency and accuracy compared to state-of-the-art methods. The code will be public.

References

[1]

F. Perazzi, J. Pont-Tuset, B. McWilliams, L.V. Gool, M.H. Gross, A. Sorkine-Hornung, A benchmark dataset and evaluation methodology for video object segmentation, CVPR, 2016, pp. 724–732.

Highlights

Abstract

References

Cited By

Index Terms

Recommendations

Similarity-Based Context-Aware Recommendation

Asymmetric Label Propagation for Video Object Segmentation

Fast Video Object Segmentation Using the Global Context Module

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations