[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3680581acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

Published: 28 October 2024 Publication History

Abstract

Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi-modal annotated data. In this paper, we propose a universal framework named X-Prompt for all multi-modal video object segmentation tasks, designated as RGB+X. The X-Prompt framework first pre-trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi-modal tasks with limited data. Within the X-Prompt framework, we introduce the Multi-modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi-modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality-specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X-Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X-Prompt framework consistently outperforms the full fine-tuning paradigm and achieves state-of-the-art performance. Code: https://github.com/PinxueGuo/X-Prompt.git

References

[1]
Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16. Springer, 777--794.
[2]
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 221--230.
[3]
Sergi Caelles, Alberto Montes, Kevis-Kokitsi Maninis, Yuhua Chen, Luc Van Gool, Federico Perazzi, and Jordi Pont-Tuset. 2018. The 2018 DAVIS Challenge on Video Object Segmentation. arXiv:1803.00557 (2018).
[4]
Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. 2019. The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation. arXiv:1905.00737 (2019).
[5]
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, Vol. 35 (2022), 16664--16678.
[6]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8126--8135.
[7]
Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1189--1198.
[8]
Ho Kei Cheng and Alexander G Schwing. 2022. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXVIII. Springer, 640--658.
[9]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5559--5568.
[10]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 11781--11794.
[11]
Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, and Ming-Hsuan Yang. 2018. Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7415--7424.
[12]
Suhwan Cho, Heansung Lee, Minhyeok Lee, Chaewon Park, Sungjun Jang, Minjung Kim, and Sangyoun Lee. 2022. Tackling background distraction in video object segmentation. In European Conference on Computer Vision. Springer, 446--462.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[14]
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. 2023. Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872 (2023).
[15]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[16]
Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor. 2021. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5912--5921.
[17]
Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, Jörg Conradt, Kostas Daniilidis, et al. 2020. Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence, Vol. 44, 1 (2020), 154--180.
[18]
Jianxiong Gao, Xuelin Qian, Yikai Wang, Tianjun Xiao, Tong He, Zheng Zhang, and Yanwei Fu. 2023. Coarse-to-fine amodal segmentation with shape prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1262--1271.
[19]
Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, and Yu Qiao. 2022. Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892 (2022).
[20]
Brent Griffin, Victoria Florence, and Jason Corso. 2020. Video Object Segmentation-based Visual Servo Control and Object Depth Estimation on a Mobile Robot. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[21]
Pinxue Guo, Lingyi Hong, Xinyu Zhou, Shuyong Gao, Wanyun Li, Jinglun Li, Zhaoyu Chen, Xiaoqiang Li, Wei Zhang, and Wenqiang Zhang. 2024. ClickVOS: Click Video Object Segmentation. arXiv preprint arXiv:2403.06130 (2024).
[22]
Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. 2023. OpenVIS: Open-vocabulary video instance segmentation. arXiv preprint arXiv:2305.16835 (2023).
[23]
Pinxue Guo, Wei Zhang, Xiaoqiang Li, and Wenqiang Zhang. 2022. Adaptive online mutual learning bi-decoders for video object segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 7063--7077.
[24]
Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, et al. 2024. Onetracker: Unifying visual object tracking with foundation models and efficient tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19079--19091.
[25]
Lingyi Hong, Wei Zhang, Shuyong Gao, Hong Lu, and WenQiang Zhang. 2023. Simulflow: Simultaneously extracting feature and identifying target for unsupervised video object segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 7481--7490.
[26]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790--2799.
[27]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
[28]
Ping Hu, Gang Wang, Xiangfei Kong, Jason Kuen, and Yap-Peng Tan. 2018. Motion-guided cascaded refinement network for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1400--1409.
[29]
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Videomatch: Matching based video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 54--70.
[30]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709--727.
[31]
Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8953--8962.
[32]
Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. 2024. Segment anything in high quality. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[33]
Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In So Kweon. 2020. Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9859--9868.
[34]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
[35]
Chenglong Li, Xinyan Liang, Yijuan Lu, Nan Zhao, and Jin Tang. 2019. RGB-T object tracking: Benchmark and baseline. Pattern Recognition, Vol. 96 (2019), 106977. https://doi.org/10.1016/j.patcog.2019.106977
[36]
Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, and Xiaoyan Sun. 2024. Event-assisted Low-Light Video Object Segmentation. arXiv preprint arXiv:2404.01945 (2024).
[37]
Wanyun Li, Pinxue Guo, Xinyu Zhou, Lingyi Hong, Yangji He, Xiangyu Zheng, Wei Zhang, and Wenqiang Zhang. 2024. OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework. arXiv preprint arXiv:2403.08682 (2024).
[38]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.
[39]
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, Vol. 35 (2022), 1950--1965.
[40]
Ruixin Liu, Bairong Li, and Yuesheng Zhu. 2021. Temporal group fusion network for deep video inpainting. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 6 (2021), 3539--3551.
[41]
Wei Liu, Shengcai Liao, and Weidong Hu. 2019. Perceiving motion from dynamic memory for vehicle detection in surveillance videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 12 (2019), 3558--3567.
[42]
Ye Liu, Xiao-Yuan Jing, Jianhui Nie, Hao Gao, Jun Liu, and Guo-Ping Jiang. 2018. Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in RGB-D videos. IEEE Transactions on Multimedia, Vol. 21, 3 (2018), 664--677.
[43]
Yong Liu, Ran Yu, Fei Yin, Xinyuan Zhao, Wei Zhao, Weihao Xia, and Yujiu Yang. 2022. Learning quality-aware dynamic memory for video object segmentation. In European Conference on Computer Vision. Springer, 468--486.
[44]
Alexandre Lopes, Roberto Souza, and Helio Pedrini. 2022. A survey on RGB-D datasets. Computer Vision and Image Understanding, Vol. 222 (2022), 103489.
[45]
Alan Lukezic, Jiri Matas, and Matej Kristan. 2020. D3s-a discriminative single shot segmentation tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7133--7142.
[46]
K-K Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2018. Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 6 (2018), 1515--1530.
[47]
Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. 2017. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE International Conference on Computer Vision. 1154--1163.
[48]
Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7376--7385.
[49]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226--9235.
[50]
Prashant W Patil, Akshay Dudhane, Ashutosh Kulkarni, Subrahmanyam Murala, Anil Balaji Gonde, and Sunil Gupta. 2021. An Unified Recurrent Video Object Segmentation Framework for Various Surveillance Environments. IEEE Transactions on Image Processing, Vol. 30 (2021), 7889--7902.
[51]
Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. 2017. Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2663--2672.
[52]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. 2016. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Computer Vision and Pattern Recognition.
[53]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).
[54]
Yanlin Qian, Song Yan, Alan Lukevzivc, Matej Kristan, Joni-Kristian Kämäräinen, and Jivrí Matas. 2021. DAL: A deep depth-aware long-term tracker. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 7825--7832.
[55]
Seonguk Seo, Joon-Young Lee, and Bohyung Han. 2020. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XV 16. Springer, 208--223.
[56]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXII 16. Springer, 629--645.
[57]
Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, and Euntai Kim. 2021. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12889--12898.
[58]
Kechen Song, Ying Zhao, Liming Huang, Yunhui Yan, and Qinggang Meng. 2023. RGB-T image analysis technology and application: A survey. Engineering Applications of Artificial Intelligence, Vol. 120 (2023), 105919. https://doi.org/10.1016/j.engappai.2023.105919
[59]
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481--9490.
[60]
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 1328--1338.
[61]
Wenguan Wang, Jianbing Shen, Fatih Porikli, and Ruigang Yang. 2018. Semi-supervised video object segmentation with super-trajectories. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 4 (2018), 985--998.
[62]
Xiao Wang, Jianing Li, Lin Zhu, Zhipeng Zhang, Zhe Chen, Xin Li, Yaowei Wang, Yonghong Tian, and Feng Wu. 2023. Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Transactions on Cybernetics (2023).
[63]
Zongji Wang, Xiaowu Chen, and Dongqing Zou. 2017. Copy and paste: Temporally consistent stereoscopic video blending. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2017), 3053--3065.
[64]
Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. 2019. Ranet: Ranking attention network for fast video object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 3978--3987.
[65]
Huaxin Xiao, Jiashi Feng, Guosheng Lin, Yu Liu, and Maojun Zhang. 2018. Monet: Deep motion exploitation for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1140--1148.
[66]
Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. 2018. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018).
[67]
Xiaohao Xu, Jinglu Wang, Xiao Li, and Yan Lu. 2022. Reliable propagation-correction modulation for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2946--2954.
[68]
Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5289--5298.
[69]
Jinyu Yang, Mingqi Gao, Runmin Cong, Chengjie Wang, Feng Zheng, and Alevs Leonardis. 2023. Unveiling the Power of Visible-Thermal Video Object Segmentation. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[70]
Jinyu Yang, Mingqi Gao, Feng Zheng, Xiantong Zhen, Rongrong Ji, Ling Shao, and Alevs Leonardis. 2024. Weakly-Supervised RGBD Video Object Segmentation. IEEE Transactions on Image Processing (2024).
[71]
Jinyu Yang, Zhe Li, Feng Zheng, Ales Leonardis, and Jingkuan Song. 2022. Prompting for multi-modal tracking. In Proceedings of the 30th ACM international conference on multimedia. 3492--3500.
[72]
Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 5188--5197.
[73]
Zongxin Yang, Jiaxu Miao, Xiaohan Wang, Yunchao Wei, and Yi Yang. 2022. Associating objects with scalable transformers for video object segmentation. arXiv preprint arXiv:2203.11442 (2022).
[74]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part V. Springer, 332--348.
[75]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.
[76]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 9 (2021), 4701--4712.
[77]
Z Yang, Y Wei, and Y Yang. 2022. Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[78]
Zhenxun Yuan, Xiao Song, Lei Bai, Zhe Wang, and Wanli Ouyang. 2021. Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 4 (2021), 2068--2078.
[79]
Jiaming Zhang, Ruiping Liu, Hao Shi, Kailun Yang, Simon Reiß, Kunyu Peng, Haodong Fu, Kaiwei Wang, and Rainer Stiefelhagen. 2023. Delivering arbitrary-modal semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1136--1147.
[80]
Pengyu Zhang, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu, and Xiaoyun Yang. 2021. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Transactions on Image Processing, Vol. 30 (2021), 3335--3347.
[81]
Pengyu Zhang, Jie Zhao, Dong Wang, Huchuan Lu, and Xiang Ruan. 2022. Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8886--8895.
[82]
Wenhui Zhang and Tejas Mahale. 2018. End to end video segmentation for driving: Lane detection for autonomous car. arXiv preprint arXiv:1812.05914 (2018).
[83]
Yizhuo Zhang, Zhirong Wu, Houwen Peng, and Stephen Lin. 2020. A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6949--6958.
[84]
Haojie Zhao, Junsong Chen, Lijun Wang, and Huchuan Lu. 2023. Arkittrack: a new diverse dataset for tracking using mobile RGB-D data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5126--5135.
[85]
Tianfei Zhou, Fatih Porikli, David J Crandall, Luc Van Gool, and Wenguan Wang. 2022. A survey on deep learning technique for video segmentation. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 6 (2022), 7099--7122.
[86]
Xinyu Zhou, Pinxue Guo, Lingyi Hong, Jinglun Li, Wei Zhang, Weifeng Ge, and Wenqiang Zhang. 2024. Reading relevant feature from global representation memory for visual object tracking. Advances in Neural Information Processing Systems, Vol. 36 (2024).
[87]
Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu. 2023. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9516--9526.
[88]
Wencheng Zhu, Jiahao Li, Jiwen Lu, and Jie Zhou. 2022. Separable Structure Modeling for Semi-Supervised Video Object Segmentation. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 1 (2022), 330--344.

Index Terms

  1. X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multi-modal adaptatin expert
    2. multi-modal video object segmentation
    3. multi-modal visual prompt
    4. rgb-x
    5. x-prompt

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 66
      Total Downloads
    • Downloads (Last 12 months)66
    • Downloads (Last 6 weeks)54
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media