[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581783.3612314acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning

Published: 27 October 2023 Publication History

Abstract

This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame prediction task - Video Object Grounding (VOG). This investigation springs from the recognition of the limited generalization capabilities of data-driven approaches when confronted with unseen test scenarios. We set the goal of enhancing the adaptability of the source-dominated model from a labeled domain to the unlabeled target domain through re-training on pseudo-labels (i.e., predicted boxes of language-described objects). Given the potential for source-domain biases in the pseudo-label generation, we decompose the labeling refinement as two cascaded debiasing subroutines: (1) we develop a discarded training strategy to correct the Biased Proposal Selection by filtering out the examples with uncertain proposals selected from the proposal (candidate box) set. The identifier of these uncertain examples is the discordance between the predictions of the source-dominated model and those of a target-domain clustered classifier, which remains free from the source-domain bias. (2) With the refined proposals as a foundation, we measure Grounding Coordinate Offset based on the semantic distance of the model's prediction across domains, based on which we alleviate source-domain bias in the target model through adversarial learning. To verify the superiority of the proposed method, we collected two UDA-VOG datasets called I2O-VOG and R2M-VOG by manually dividing and combining the well-known VOG datasets. The extensive experiments on them show our model significantly outperforms SOTA methods by a large margin.

References

[1]
Eric Arazo, Diego Ortego, Paul Albert, Noel E O'Connor, and Kevin McGuinness. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[2]
Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On pursuit of designing multi-modal transformer for video grounding. arXiv preprint arXiv:2109.06085 (2021).
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.
[4]
Junshen K Chen, Dallas Card, and Dan Jurafsky. 2022. Modular Domain Adaptation. arXiv preprint arXiv:2204.14213 (2022).
[5]
Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. 2019. Weakly-supervised spatio-temporally grounding natural sentence in video. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019), 1884--1894.
[6]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524.
[7]
Chengcheng Han, Zeqiu Fan, Dongxiang Zhang, Minghui Qiu, Ming Gao, and Aoying Zhou. 2021. Meta-learning adversarial domain adaptation network for few-shot text classification. arXiv preprint arXiv:2107.12262 (2021).
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[9]
Yi Huang, Xiaoshan Yang, Ji Zhang, and Changsheng Xu. 2022. Relative alignment network for source-free multimodal video domain adaptation. In Proceedings of the 30th ACM International Conference on Multimedia. 1652--1660.
[10]
Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, and Tat-Seng Chua. 2022. Mrtnet: Multi-resolution temporal network for video sentence grounding. arXiv preprint arXiv:2212.13163 (2022).
[11]
Wei Ji, Renjie Liang, Lizi Liao, Hao Fei, and Fuli Feng. 2023 a. Partial Annotation-based Video Moment Retrieval via Iterative Learning. In Proceedings of the 31th ACM international conference on Multimedia.
[12]
Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, and Tat-seng Chua. 2023 b. Are binary annotations sufficient? video moment retrieval via hierarchical uncertainty-based active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23013--23022.
[13]
Junguang Jiang, Baixu Chen, Jianmin Wang, and Mingsheng Long. 2021. Decoupled Adaptation for Cross-Domain Object Detection. arXiv preprint arXiv:2110.02578 (2021).
[14]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1780--1790.
[15]
Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Mengze Li, Wei Ji, Qi Tian, Tat-Seng Chua, and Yueting Zhuang. 2023 a. Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models. (2023).
[16]
Juncheng Li, XIN HE, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. 2022a. Fine-Grained Semantically Aligned Vision-Language Pre-Training. In Advances in Neural Information Processing Systems.
[17]
Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, and Yueting Zhuang. 2021. Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1867--1877.
[18]
Juncheng Li, Siliang Tang, Linchao Zhu, Wenqiao Zhang, Yi Yang, Tat-Seng Chua, and Fei Wu. 2023 b. Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[19]
Juncheng Li, Xin Wang, Siliang Tang, Haizhou Shi, Fei Wu, Yueting Zhuang, and William Yang Wang. 2020. Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12123--12132.
[20]
Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, and Xin Eric Wang. 2022 e. Compositional temporal grounding with structured variational cross-graph correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3032--3041.
[21]
Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, and Fei Wu. 2023 d. Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23090--23099.
[22]
Mengze Li, Tianbao Wang, Jiahe Xu, Kairong Han, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Shiliang Pu, and Fei Wu. 2023 c. Multi-modal Action Chain Abductive Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4617--4628.
[23]
Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, et al. 2022c. End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8707--8717.
[24]
Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Wenqiao Zhang, Jiaxu Miao, Shiliang Pu, and Fei Wu. 2022d. Hero: Hierarchical spatio-temporal reasoning with contrastive action correspondence for end-to-end video object grounding. In Proceedings of the 30th ACM International Conference on Multimedia. 3801--3810.
[25]
Miaoyu Li, Yachao Zhang, Yuan Xie, Zuodong Gao, Cuihua Li, Zhizhong Zhang, and Yanyun Qu. 2022 f. Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation. In Proceedings of the 30th ACM International Conference on Multimedia. 3829--3837.
[26]
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022b. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2928--2937.
[27]
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. 2017. Tracking by natural language specification. Piscataway, NJIEEE.
[28]
Daizong Liu, Xiaoye Qu, and Wei Hu. 2022. Reducing the vision and language bias for temporal sentence grounding. In Proceedings of the 30th ACM International Conference on Multimedia. 4092--4101.
[29]
Hong Liu, Jianmin Wang, and Mingsheng Long. 2021. Cycle self-training for domain adaptation. arXiv preprint arXiv:2103.03571 (2021).
[30]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[31]
Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. arXiv e-prints, arXiv--1701.
[32]
Jianming Lv, Kaijie Liu, and Shengfeng He. 2021. Differentiated learning for multi-modal domain adaptation. In Proceedings of the 29th ACM International Conference on Multimedia. 1322--1330.
[33]
Zheqi Lv, Zhengyu Chen, Shengyu Zhang, Kun Kuang, Wenqiao Zhang, Mengze Li, Beng Chin Ooi, and Fei Wu. 2023 a. IDEAL: Toward High-efficiency Device-Cloud Collaborative and Dynamic Recommendation System. arXiv preprint arXiv:2302.07335 (2023).
[34]
Zheqi Lv, Feng Wang, Shengyu Zhang, Kun Kuang, Hongxia Yang, and Fei Wu. 2022. Personalizing Intervened Network for Long-tailed Sequential User Behavior Modeling. arXiv preprint arXiv:2208.09130 (2022).
[35]
Zheqi Lv, Wenqiao Zhang, Shengyu Zhang, Kun Kuang, Feng Wang, Yongwei Wang, Zhengyu Chen, Tao Shen, Hongxia Yang, Beng Chin Ooi, and Fei Wu. 2023 b. DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization. In Proceedings of the ACM Web Conference 2023.
[36]
Viraj Prabhu, Shivam Khare, Deeksha Kartik, and Judy Hoffman. 2020. Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation. arXiv preprint arXiv:2012.11460 (2020).
[37]
Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. 2018. Universal Dependency Parsing from Scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Brussels, Belgium, 160--170. https://nlp.stanford.edu/pubs/qi2018universal.pdf
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015).
[39]
Arka Sadhu, Kan Chen, and Ram Nevatia. 2020. Video object grounding using semantic roles in language description. arXiv preprint arXiv:2003.10606 (2020).
[40]
Rui Su, Qian Yu, and Dong Xu. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1533--1542.
[41]
Yuxi Sun, Shanshan Feng, Xutao Li, Yunming Ye, Jian Kang, and Xu Huang. 2022. Visual Grounding in Remote Sensing Images. In Proceedings of the 30th ACM International Conference on Multimedia. 404--412.
[42]
Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. 2020. Human-centric spatio-temporal video grounding with visual transformers. arXiv preprint arXiv:2011.05049 (2020).
[43]
Arun Balajee Vasudevan, Dengxin Dai, and Luc Van Gool. 2018. Object referring in videos with language and human gaze. arXiv preprint arXiv:1801.01582 (2018).
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[45]
Yuanbin Wang, Leyan Zhu, Shaofei Huang, Tianrui Hui, Xiaojie Li, Fei Wang, and Si Liu. 2022. Cross-Modality Domain Adaptation for Freespace Detection: A Simple yet Effective Baseline. In Proceedings of the 30th ACM International Conference on Multimedia. 4031--4042.
[46]
Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si, and Fei Wu. 2020. De-biased court's view generation with causality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 763--780.
[47]
Yiquan Wu, Weiming Lu, Yating Zhang, Adam Jatowt, Jun Feng, Changlong Sun, Fei Wu, and Kun Kuang. 2023. Focus-aware Response Generation in Inquiry Conversation. In Findings of the Association for Computational Linguistics: ACL 2023. 12585--12599.
[48]
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16442--16453.
[49]
Yalan Ye, Ziqi Liu, Yangwuyong Zhang, Jingjing Li, and Hengtao Shen. 2022. Alleviating Style Sensitivity then Adapting: Source-free Domain Adaptation for Medical Image Segmentation. In Proceedings of the 30th ACM International Conference on Multimedia. 1935--1944.
[50]
Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang, and Fang Wen. 2021b. Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation. arXiv preprint arXiv:2101.10979 (2021).
[51]
Shiqing Zhang, Ruixin Liu, Yijiao Yang, Xiaoming Zhao, and Jun Yu. 2022c. Unsupervised Domain Adaptation Integrating Transformer and Mutual Information for Cross-Corpus Speech Emotion Recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 120--129.
[52]
Wenqiao Zhang, Jiannan Guo, Mengze Li, Haochen Shi, Shengyu Zhang, Juncheng Li, Siliang Tang, and Yueting Zhuang. 2022a. BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval. arXiv preprint arXiv:2207.04211 (2022).
[53]
Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu Zhang, Qingpeng Cai, Juncheng Li, Sihui Luo, and Yueting Zhuang. 2021a. MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning. arXiv preprint arXiv:2112.06558 (2021).
[54]
Wenqiao Zhang, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. 2019. Frame augmented alternating attention network for video question answering. IEEE Transactions on Multimedia, Vol. 22, 4 (2019), 1032--1041.
[55]
Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, and William Yang Wang. 2020a. Relational graph learning for grounded video description generation. In Proceedings of the 28th ACM International Conference on Multimedia. 3807--3828.
[56]
Wenqiao Zhang, Lei Zhu, James Hallinan, Andrew Makmur, Shengyu Zhang, Qingpeng Cai, and Beng Chin Ooi. 2022d. BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation. arXiv preprint arXiv:2203.02533 (2022).
[57]
Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu. 2022b. Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation. In Proceedings of the 30th ACM International Conference on Multimedia. 3338--3346.
[58]
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020b. Where does it exist: Spatio-temporal video grounding for multi-form sentences. arXiv e-prints, arXiv--2001.
[59]
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. 2018. Grounded video description. arXiv preprint arXiv:1812.06587 (2018).

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Importance-aware Shared Parameter Subspace Learning for Domain Incremental LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681411(8874-8883)Online publication date: 28-Oct-2024
  • (2024)Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01585(16751-16761)Online publication date: 16-Jun-2024
  • Show More Cited By

Index Terms

  1. Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing Learning
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Please enable JavaScript to view thecomments powered by Disqus.

              Information & Contributors

              Information

              Published In

              cover image ACM Conferences
              MM '23: Proceedings of the 31st ACM International Conference on Multimedia
              October 2023
              9913 pages
              ISBN:9798400701085
              DOI:10.1145/3581783
              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Sponsors

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              Published: 27 October 2023

              Permissions

              Request permissions for this article.

              Check for updates

              Author Tags

              1. cascaded debiasing learning
              2. unsupervised domain adaptation
              3. video object grounding

              Qualifiers

              • Research-article

              Conference

              MM '23
              Sponsor:
              MM '23: The 31st ACM International Conference on Multimedia
              October 29 - November 3, 2023
              Ottawa ON, Canada

              Acceptance Rates

              Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)134
              • Downloads (Last 6 weeks)5
              Reflects downloads up to 12 Dec 2024

              Other Metrics

              Citations

              Cited By

              View all
              • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
              • (2024)Importance-aware Shared Parameter Subspace Learning for Domain Incremental LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681411(8874-8883)Online publication date: 28-Oct-2024
              • (2024)Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01585(16751-16761)Online publication date: 16-Jun-2024
              • (2024)Learning Frequency and Structure in UDA for Medical Object DetectionPattern Recognition and Computer Vision10.1007/978-981-97-8496-7_36(518-532)Online publication date: 3-Nov-2024
              • (2023)A unified approach to domain incremental learning with memoryProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666782(15027-15059)Online publication date: 10-Dec-2023

              View Options

              Login options

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media