[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3680817acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Tracking-forced Referring Video Object Segmentation

Published: 28 October 2024 Publication History

Abstract

Referring video object segmentation (RVOS) is a cross-modal task that aims to segment the target object described by language expressions. A video typically consists of multiple frames and existing works conduct segmentation at either the clip-level or the frame-level. Clip-level methods process a clip at once and segment in parallel, lacking explicit inter-frame interactions. In contrast, frame-level methods facilitate direct interactions between frames by processing videos frame by frame, but they are prone to error accumulation. In this paper, we propose a novel tracking-forced framework, introducing high-quality tracking information and forcing the model to achieve accurate segmentation. Concretely, we utilize the ground-truth segmentation of previous frames as accurate inter-frame interactions, providing high-quality tracking references for segmentation in the next frame. This decouples the current input from the previous output, which enables our model to concentrate on accurately segmenting just based on given tracking information, improving training efficiency and preventing error accumulation. For the inference stage without ground-truth masks, we carefully select the beginning frame to construct tracking information, aiming to ensure accurate tracking-based frame-by-frame object segmentation. With these designs, our tracking-forced method significantly outperforms existing methods on 4 widely used benchmarks by at least 3%. Especially, our method achieves 88.3% [email protected] accuracy and 87.6 overall IoU score on the JHMDB-Sentences dataset, surpassing previous best methods by 5.0% and 8.0, respectively.

References

[1]
Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. 2022. End-to-End Referring Video Object Segmentation with Multimodal Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 4975--4985.
[2]
Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, and Nicolas Thome. 2023. VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing. CoRR, Vol. abs/2306.08707 (2023). https://doi.org/10.48550/ARXIV.2306.08707
[3]
Zihan Ding, Tianrui Hui, Shaofei Huang, Si Liu, Xuan Luo, Junshi Huang, and Xiaoming Wei. 2021. Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, Vol. 8 (2021), 6.
[4]
Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. 2018. Actor and Action Video Segmentation From a Sentence. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. Computer Vision Foundation / IEEE Computer Society, 5958--5966. https://doi.org/10.1109/CVPR.2018.00624
[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. IEEE Computer Society, 770--778.
[6]
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from Natural Language Expressions. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 9905). Springer, 108--124. https://doi.org/10.1007/978--3--319--46448-0_7
[7]
Xiao Hu, Basavaraj Hampiholi, Heiko Neumann, and Jochen Lang. 2024. Temporal Context Enhanced Referring Video Object Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5574--5583.
[8]
Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, and Fei Wang. 2021. Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 4187--4196.
[9]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards Understanding Action Recognition. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013. IEEE Computer Society, 3192--3199.
[10]
Fengling Jiang, Zeling Wang, and Guoqing Yue. 2024. A Novel Cognitively Inspired Deep Learning Approach to Detect Drivable Areas for Self-driving Cars. Cogn. Comput., Vol. 16, 2 (2024), 517--533.
[11]
Anna Khoreva, Anna Rohrbach, and Bernt Schiele. 2018. Video Object Segmentation with Language Referring Expressions. In Computer Vision - ACCV 2018 - 14th Asian Conference on Computer Vision, Perth, Australia, December 2--6, 2018, Revised Selected Papers, Part IV (Lecture Notes in Computer Science, Vol. 11364). Springer, 123--141. https://doi.org/10.1007/978--3-030--20870--7_8
[12]
Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. 2023. Shape-Aware Text-Driven Layered Video Editing. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17--24, 2023. IEEE, 14317--14326.
[13]
Jixiang Li, Jiahao Pi, Pengjin Wei, Zhaotong Luo, and Guohang Yan. 2024. Automatic Multi-Camera Calibration and Refinement Method in Road Scene for Self-Driving Car. IEEE Trans. Intell. Veh., Vol. 9, 1 (2024), 2429--2438.
[14]
Chen Liang, Yu Wu, Yawei Luo, and Yi Yang. 2021. ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation. CoRR, Vol. abs/2103.10702 (2021). showeprint[arXiv]2103.10702 https://arxiv.org/abs/2103.10702
[15]
Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation. CoRR, Vol. abs/2106.01061 (2021). showeprint[arXiv]2106.01061 https://arxiv.org/abs/2106.01061
[16]
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 2999--3007.
[17]
Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, and Guanbin Li. 2022. Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 9 (2022), 4761--4775.
[18]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs/1907.11692 (2019). showeprint[arXiv]1907.11692 http://arxiv.org/abs/1907.11692
[19]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 9992--10002.
[20]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video Swin Transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 3192--3201.
[21]
Bruce McIntosh, Kevin Duarte, Yogesh S. Rawat, and Mubarak Shah. 2020. Visual-Textual Capsule Routing for Text-Based Video Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation / IEEE, 9939--9948.
[22]
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Fourth International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, October 25--28, 2016. IEEE Computer Society, 565--571.
[23]
Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar Relative Positional Encoding for Video-Language Segmentation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 948--954.
[24]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 DAVIS Challenge on Video Object Segmentation. CoRR, Vol. abs/1704.00675 (2017).
[25]
Seonguk Seo, Joon-Young Lee, and Bohyung Han. 2020. URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 12360). Springer, 208--223.
[26]
Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 12152--12159.
[27]
Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 3938--3947.
[28]
Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. 2023. OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1--6, 2023. IEEE, 2749--2758.
[29]
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as Queries for Referring Video Object Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 4964--4974.
[30]
Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. 2015. Can humans fly? Action understanding with multiple classes of actors. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 2264--2273.
[31]
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian L. Price, Scott Cohen, and Thomas S. Huang. 2018. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 11209). Springer, 603--619.
[32]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-Attention Network for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 10502--10511. https://doi.org/10.1109/CVPR.2019.01075
[33]
Linwei Ye, Mrigank Rochan, Zhi Liu, Xiaoqin Zhang, and Yang Wang. 2022. Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 7 (2022), 3719--3732. https://doi.org/10.1109/TPAMI.2021.3054384
[34]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling Context in Referring Expressions. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 9906). Springer, 69--85.
[35]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=gZ9hCDWe6ke

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. parallel training
  2. referring video object segmentation
  3. sequential inference
  4. tracking-forced framework

Qualifiers

  • Research-article

Funding Sources

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 49
    Total Downloads
  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)15
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media