OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
Pages 20 - 40
Abstract
Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS. Extensive experiments demonstrate the superiority of OneVOS, achieving state-of-the-art performance across 7 datasets, particularly excelling in complex LVOS and MOSE datasets with 70.1% and 66.4% scores, surpassing previous state-of-the-art methods by 4.2% and 7.0%, respectively. Code is available at: https://github.com/L599wy/OneVOS.
References
[1]
Bhat G et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Learning what to learn for video object segmentation Computer Vision – ECCV 2020 2020 Cham Springer 777-794
[2]
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
[3]
Chen B et al. Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, et al. Backbone is all your need: a simplified architecture for visual object tracking Computer Vision – ECCV 2022 2022 Cham Springer 375-392
[4]
Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., Qi, D.: State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9384–9393 (2020)
[5]
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
[6]
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018)
[7]
Cheng HK and Schwing AG Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model Computer Vision – ECCV 2022 2022 Cham Springer 640-658
[8]
Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5559–5568 (2021)
[9]
Cheng HK, Tai YW, and Tang CK Rethinking space-time networks with improved memory coverage for efficient video object segmentation Adv. Neural. Inf. Process. Syst. 2021 34 11781-11794
[10]
Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7415–7424 (2018)
[11]
Cheng MM, Mitra NJ, Huang X, Torr PH, and Hu SM Global contrast based salient region detection IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 3 569-582
[12]
Cui, Y., Jiang, C., Wang, L., Wu, G.: MixFormer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
[13]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
[14]
Ding, H., Liu, C., He, S., Jiang, X., Torr, P.H., Bai, S.: MOSE: a new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872 (2023)
[15]
Dosovitskiy, A., et al.: An image is worth 1616 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[16]
Duke, B., Ahmed, A., Wolf, C., Aarabi, P., Taylor, G.W.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5912–5921 (2021)
[17]
Everingham M, Van Gool L, Williams CK, Winn J, and Zisserman A The pascal visual object classes (VOC) challenge Int. J. Comput. Vision 2010 88 303-338
[18]
Fang, R., et al.: InstructSeq: unifying vision tasks with instruction-conditioned multi-modal sequence generation. arXiv preprint arXiv:2311.18835 (2023)
[19]
Gao, J., et al.: Coarse-to-fine amodal segmentation with shape prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1262–1271 (2023)
[20]
Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: ConvMAE: masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892 (2022)
[21]
Gao S, Zhou C, Ma C, Wang X, and Yuan J Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T AiATrack: attention in attention for transformer visual tracking Computer Vision – ECCV 2022 2022 Cham Springer 146-164
[22]
Guo, P., et al.: ClickVOS: click video object segmentation. arXiv preprint arXiv:2403.06130 (2024)
[23]
Guo P, Zhang W, Li X, and Zhang W Adaptive online mutual learning bi-decoders for video object segmentation IEEE Trans. Image Process. 2022 31 7063-7077
[24]
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: 2011 International Conference on Computer Vision, pp. 991–998. IEEE (2011)
[25]
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
[26]
Hong, L., et al.: LVOS: a benchmark for long-term video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13480–13492 (2023)
[27]
Hong, L., et al.: OneTracker: unifying visual object tracking with foundation models and efficient tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19079–19091 (2024)
[28]
Hong L, Zhang W, Chen L, Zhang W, and Fan J Adaptive selection of reference frames for video object segmentation IEEE Trans. Image Process. 2021 31 1057-1071
[29]
Hu, P., Wang, G., Kong, X., Kuen, J., Tan, Y.P.: Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409 (2018)
[30]
Hu, Y.T., Huang, J.B., Schwing, A.G.: VideoMatch: Matching based video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp. 54–70 (2018)
[31]
Huang, X., Xu, J., Tai, Y.W., Tang, C.K.: Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8879–8889 (2020)
[32]
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
[33]
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8953–8962 (2019)
[34]
Khoreva A, Benenson R, Ilg E, Brox T, and Schiele B Lucid data dreaming for video object segmentation Int. J. Comput. Vision 2019 127 9 1175-1197
[35]
Kristan, M., et al.: The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
[36]
Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., Liu, D.: Recurrent dynamic embedding for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1332–1341 (2022)
[37]
Li, W., Fan, J., Guo, P., Hong, L., Zhang, W.: HFVOS: history-future integrated dynamic memory for video object segmentation. IEEE Trans. Circuits Syst. Video Technol. (2024)
[38]
Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 90–105 (2018)
[39]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[40]
Lin, Z., et al.: SWEM: towards real-time video object segmentation with sequential weighted expectation-maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1362–1372 (2022)
[41]
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
[42]
Maninis KK et al. Video object segmentation without temporal information IEEE Trans. Pattern Anal. Mach. Intell. 2018 41 6 1515-1530
[43]
Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–555 (2014)
[44]
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
[45]
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9226–9235 (2019)
[46]
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663–2672 (2017)
[47]
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
[48]
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
[49]
Rao Y, Zhao W, Liu B, Lu J, Zhou J, and Hsieh CJ DynamicViT: efficient vision transformers with dynamic token sparsification Adv. Neural. Inf. Process. Syst. 2021 34 13937-13949
[50]
Seong H, Hyun J, and Kim E Vedaldi A, Bischof H, Brox T, and Frahm J-M Kernelized memory network for video object segmentation Computer Vision – ECCV 2020 2020 Cham Springer 629-645
[51]
Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12889–12898 (2021)
[52]
Shi J, Yan Q, Xu L, and Jia J Hierarchical image saliency detection on extended CSSD IEEE Trans. Pattern Anal. Mach. Intell. 2015 38 4 717-729
[53]
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9481–9490 (2019)
[54]
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)
[55]
Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1296–1305 (2021)
[56]
Wang, J., et al.: Look before you match: Instance understanding matters in video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2268–2278 (2023)
[57]
Wang W, Shen J, Porikli F, and Yang R Semi-supervised video object segmentation with super-trajectories IEEE Trans. Pattern Anal. Mach. Intell. 2018 41 4 985-998
[58]
Wu, Q., Yang, T., Liu, Z., Wu, B., Shan, Y., Chan, A.B.: DropMAE: masked autoencoders with spatial-attention dropout for tracking tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14561–14571 (2023)
[59]
Wu, Q., Yang, T., Wu, W., Chan, A.B.: Scalable video object segmentation with simplified framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13879–13889 (2023)
[60]
Xiao, H., Feng, J., Lin, G., Liu, Y., Zhang, M.: MoNet: deep motion exploitation for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1140–1148 (2018)
[61]
Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
[62]
Xu, S., Liu, D., Bao, L., Liu, W., Zhou, P.: MHP-VOS: multiple hypotheses propagation for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 314–323 (2019)
[63]
Yan, S., Xu, X., Hong, L., Chen, W., Zhang, W., Zhang, W.: PanoVOS: bridging non-panoramic and panoramic views with transformer for video segmentation. arXiv preprint arXiv:2309.12303 (2023)
[64]
Yan, S., et al.: Referred by multi-modality: a unified temporal transformer for video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6449–6457 (2024)
[65]
Yang Z, Wei Y, and Yang Y Vedaldi A, Bischof H, Brox T, and Frahm J-M Collaborative video object segmentation by foreground-background integration Computer Vision – ECCV 2020 2020 Cham Springer 332-348
[66]
Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
[67]
Yang Z, Wei Y, and Yang Y Collaborative video object segmentation by multi-scale foreground-background integration IEEE Trans. Pattern Anal. Mach. Intell. 2021 44 9 4701-4712
[68]
Yang, Z., Yang, Y.: Decoupling features in hierarchical propagation for video object segmentation. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
[69]
Ye B, Chang H, Ma B, Shan S, and Chen X Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T Joint feature learning and relation modeling for tracking: a one-stream framework Computer Vision – ECCV 2022 2022 Cham Springer 341-357
[70]
Zhou, X., et al.: Reading relevant feature from global representation memory for visual object tracking. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Index Terms
- OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
Index terms have been assigned to the content through auto-classification.
Recommendations
A Unified Transformer Framework for Group-Based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researchers focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object ...
Bidirectional correlation-driven inter-frame interaction Transformer for referring video object segmentation
AbstractReferring video object segmentation (RVOS) aims to segment the target object in a video sequence described by a language expression. Typical multimodal Transformer based RVOS approaches process video sequence in a frame-independent manner to ...
Highlights- A language query based multimodal RVOS Transformer outperforms the previous cutting-edge methods on several benchmarks.
- Bidirectional multi-level vision-language interaction boosts correlation between cross-modal features.
- Inter-...
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Sep 2024
568 pages
ISBN:978-3-031-73635-3
DOI:10.1007/978-3-031-73636-0
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
Publisher
Springer-Verlag
Berlin, Heidelberg
Publication History
Published: 05 November 2024
Qualifiers
- Article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025