More Web Proxy on the site http://driver.im/

Article

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework

Authors:

Wenqiang ZhangAuthors Info & Claims

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LVIII

Pages 20 - 40

https://doi.org/10.1007/978-3-031-73636-0_2

Published: 05 November 2024 Publication History

Abstract

Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4%

J & F

scores, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively. Code is available at: https://github.com/L599wy/OneVOS.

References

[1]

Bhat G et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Learning what to learn for video object segmentation Computer Vision – ECCV 2020 2020 Cham Springer 777-794

Digital Library

[2]

Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)

[3]

Chen B et al. Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, et al. Backbone is all your need: a simplified architecture for visual object tracking Computer Vision – ECCV 2022 2022 Cham Springer 375-392

Digital Library

[4]

Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., Qi, D.: State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9384–9393 (2020)

[5]

Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)

[6]

Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018)

[7]

Cheng HK and Schwing AG Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model Computer Vision – ECCV 2022 2022 Cham Springer 640-658

Digital Library

[8]

Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5559–5568 (2021)

[9]

Cheng HK, Tai YW, and Tang CK Rethinking space-time networks with improved memory coverage for efficient video object segmentation Adv. Neural. Inf. Process. Syst. 2021 34 11781-11794

[10]

Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7415–7424 (2018)

[11]

Cheng MM, Mitra NJ, Huang X, Torr PH, and Hu SM Global contrast based salient region detection IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 3 569-582

Digital Library

[12]

Cui, Y., Jiang, C., Wang, L., Wu, G.: MixFormer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)

[13]

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

[14]

Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872 (2023)

[15]

Dosovitskiy, A., et al.: An image is worth 16

\times

16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

[16]

Duke, B., Ahmed, A., Wolf, C., Aarabi, P., Taylor, G.W.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5912–5921 (2021)

[17]

Everingham M, Van Gool L, Williams CK, Winn J, and Zisserman A The pascal visual object classes (VOC) challenge Int. J. Comput. Vision 2010 88 303-338

Digital Library

[18]

Fang, R., et al.: InstructSeq: unifying vision tasks with instruction-conditioned multi-modal sequence generation. arXiv preprint arXiv:2311.18835 (2023)

[19]

Gao, J., et al.: Coarse-to-fine amodal segmentation with shape prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1262–1271 (2023)

[20]

Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: ConvMAE: masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892 (2022)

[21]

Gao S, Zhou C, Ma C, Wang X, and Yuan J Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T AiATrack: attention in attention for transformer visual tracking Computer Vision – ECCV 2022 2022 Cham Springer 146-164

Digital Library

[22]

Guo, P., et al.: ClickVOS: click video object segmentation. arXiv preprint arXiv:2403.06130 (2024)

[23]

Guo P, Zhang W, Li X, and Zhang W Adaptive online mutual learning bi-decoders for video object segmentation IEEE Trans. Image Process. 2022 31 7063-7077

[24]

Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: 2011 International Conference on Computer Vision, pp. 991–998. IEEE (2011)

[25]

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

[26]

Hong, L., et al.: LVOS: a benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13480–13492 (2023)

[27]

Hong, L., et al.: OneTracker: unifying visual object tracking with foundation models and efficient tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19079–19091 (2024)

[28]

Hong L, Zhang W, Chen L, Zhang W, and Fan J Adaptive selection of reference frames for video object segmentation IEEE Trans. Image Process. 2021 31 1057-1071

[29]

Hu, P., Wang, G., Kong, X., Kuen, J., Tan, Y.P.: Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409 (2018)

[30]

Hu, Y.T., Huang, J.B., Schwing, A.G.: VideoMatch: Matching based video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp. 54–70 (2018)

[31]

Huang, X., Xu, J., Tai, Y.W., Tang, C.K.: Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8879–8889 (2020)

[32]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

[33]

Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8953–8962 (2019)

[34]

Khoreva A, Benenson R, Ilg E, Brox T, and Schiele B Lucid data dreaming for video object segmentation Int. J. Comput. Vision 2019 127 9 1175-1197

Digital Library

[35]

Kristan, M., et al.: The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

[36]

Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., Liu, D.: Recurrent dynamic embedding for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1332–1341 (2022)

[37]

Li, W., Fan, J., Guo, P., Hong, L., Zhang, W.: HFVOS: history-future integrated dynamic memory for video object segmentation. IEEE Trans. Circuits Syst. Video Technol. (2024)

[38]

Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 90–105 (2018)

[39]

Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755

[40]

Lin, Z., et al.: SWEM: towards real-time video object segmentation with sequential weighted expectation-maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1362–1372 (2022)

[41]

Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

[42]

Maninis KK et al. Video object segmentation without temporal information IEEE Trans. Pattern Anal. Mach. Intell. 2018 41 6 1515-1530

Digital Library

[43]

Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–555 (2014)

[44]

Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)

[45]

Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9226–9235 (2019)

[46]

Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663–2672 (2017)

[47]

Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)

[48]

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

[49]

Rao Y, Zhao W, Liu B, Lu J, Zhou J, and Hsieh CJ DynamicViT: efficient vision transformers with dynamic token sparsification Adv. Neural. Inf. Process. Syst. 2021 34 13937-13949

[50]

Seong H, Hyun J, and Kim E Vedaldi A, Bischof H, Brox T, and Frahm J-M Kernelized memory network for video object segmentation Computer Vision – ECCV 2020 2020 Cham Springer 629-645

Digital Library

[51]

Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12889–12898 (2021)

[52]

Shi J, Yan Q, Xu L, and Jia J Hierarchical image saliency detection on extended CSSD IEEE Trans. Pattern Anal. Mach. Intell. 2015 38 4 717-729

Digital Library

[53]

Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9481–9490 (2019)

[54]

Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)

[55]

Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1296–1305 (2021)

[56]

Wang, J., et al.: Look before you match: Instance understanding matters in video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2268–2278 (2023)

[57]

Wang W, Shen J, Porikli F, and Yang R Semi-supervised video object segmentation with super-trajectories IEEE Trans. Pattern Anal. Mach. Intell. 2018 41 4 985-998

Digital Library

[58]

Wu, Q., Yang, T., Liu, Z., Wu, B., Shan, Y., Chan, A.B.: DropMAE: masked autoencoders with spatial-attention dropout for tracking tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14561–14571 (2023)

[59]

Wu, Q., Yang, T., Wu, W., Chan, A.B.: Scalable video object segmentation with simplified framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13879–13889 (2023)

[60]

Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: MoNet: deep motion exploitation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1140–1148 (2018)

[61]

Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)

[62]

Xu, S., Liu, D., Bao, L., Liu, W., Zhou, P.: MHP-VOS: multiple hypotheses propagation for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 314–323 (2019)

[63]

Yan, S., Xu, X., Hong, L., Chen, W., Zhang, W., Zhang, W.: PanoVOS: bridging non-panoramic and panoramic views with transformer for video segmentation. arXiv preprint arXiv:2309.12303 (2023)

[64]

Yan, S., et al.: Referred by multi-modality: a unified temporal transformer for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6449–6457 (2024)

[65]

Yang Z, Wei Y, and Yang Y Vedaldi A, Bischof H, Brox T, and Frahm J-M Collaborative video object segmentation by foreground-background integration Computer Vision – ECCV 2020 2020 Cham Springer 332-348

Digital Library

[66]

Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

[67]

Yang Z, Wei Y, and Yang Y Collaborative video object segmentation by multi-scale foreground-background integration IEEE Trans. Pattern Anal. Mach. Intell. 2021 44 9 4701-4712

[68]

Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video object segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

[69]

Ye B, Chang H, Ma B, Shan S, and Chen X Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T Joint feature learning and relation modeling for tracking: a one-stream framework Computer Vision – ECCV 2022 2022 Cham Springer 341-357

Digital Library

[70]

Zhou, X., et al.: Reading relevant feature from global representation memory for visual object tracking. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

Index Terms

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Video segmentation
      2. Computer vision tasks
        Activity recognition and understanding
        Video summarization
        Visual content-based indexing and retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Distractor-Aware Video Object Segmentation
Pattern Recognition
Abstract
Semi-supervised video object segmentation is a challenging task that aims to segment a target throughout a video sequence given an initial mask at the first frame. Discriminative approaches have demonstrated competitive performance on this task at ...
A Unified Transformer Framework for Group-Based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researchers focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object ...
Bidirectional correlation-driven inter-frame interaction Transformer for referring video object segmentation
Abstract
Referring video object segmentation (RVOS) aims to segment the target object in a video sequence described by a language expression. Typical multimodal Transformer based RVOS approaches process video sequence in a frame-independent manner to ...
Highlights
- A language query based multimodal RVOS Transformer outperforms the previous cutting-edge methods on several benchmarks.
- Bidirectional multi-level vision-language interaction boosts correlation between cross-modal features.
- Inter-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LVIII

Sep 2024

568 pages

ISBN:978-3-031-73635-3

DOI:10.1007/978-3-031-73636-0

Editors:
Aleš Leonardis
University of Birmingham, Birmingham, UK
,
Elisa Ricci
https://ror.org/05trd4x28University of Trento, Trento, Italy
,
Stefan Roth
Technical University of Darmstadt, Darmstadt, Germany
,
Olga Russakovsky
Princeton University, Princeton, NJ, USA
,
Torsten Sattler
Czech Technical University in Prague, Prague, Czech Republic
,
Gül Varol
École des Ponts ParisTech, Marne-la-Vallée, France

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 05 November 2024

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents