[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3547968acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learn to Understand Negation in Video Retrieval

Published: 10 October 2022 Publication History

Abstract

Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with a dog. However, the state-of-the-art deep learning based video retrieval models lack such ability, as they are typically trained on video description datasets such as MSR-VTT and VATEX that lack negated descriptions. Their retrieved results basically ignore the negator in the sample query, incorrectly returning videos showing kids playing with dog. This paper presents the first study on learning to understand negation in video retrieval and make contributions as follows. By re-purposing two existing datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video retrieval with negation. We propose a learning based method for training a negation-aware video retrieval model. The key idea is to first construct a soft negative caption for a specific training video by partially negating its original caption, and then compute a bidirectionally constrained loss on the triplet. This auxiliary loss is weightedly added to a standard retrieval loss. Experiments on the re-purposed benchmarks show that re-training the CLIP (Contrastive Language-Image Pre-Training) model by the proposed method clearly improves its ability to handle queries with negation. In addition, the model performance on the original benchmarks is also improved.

Supplementary Material

MP4 File (MM22-fp0907.mp4)
Presentation video of Learn to Understand Negation in Video Retrieval

References

[1]
George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth J. F. Jones, and Georges Quénot. 2021. Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVID 2021. In TRECVID Workshop.
[2]
Aozhu Chen, Fan Hu, Zihan Wang, Fangming Zhou, and Xirong Li. 2021. What Matters for Ad-hoc Video Search? A Large-scale Evaluation on TRECVID. In ICCV Workshop on ViRal.
[3]
David Chen and William Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In CVPR.
[4]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In CVPR.
[5]
Matthew Cooper, John Adcock, Robert Chen, and Hanning Zhou. 2005. FXPAL at TRECVID 2005. In TRECVID Workshop.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, and Xun Wang. 2021. Dual Encoding for Video Retrieval by Text. TPAMI 44, 8 (2021), 4065--4080.
[8]
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. MDMMT: Multidomain Multimodal Transformer for Video Retrieval. In CVPR Workshop on HVU.
[9]
Allyson Ettinger. 2020. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. TACL 8 (2020), 34--48.
[10]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.
[11]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv preprint arXiv:2106.11097 (2021).
[12]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi- Modal Transformer for Video Retrieval. In ECCV.
[13]
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In ACMMM.
[14]
Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R Devon Hjelm, Alessandro Sordoni, and Aaron Courville. 2021. Understanding by Understanding Not: Modeling Negation in Language Models. In NAACL.
[15]
Fan Hu, Aozhu Chen, ZiyueWang, Fangming Zhou, Jianfeng Dong, and Xirong Li. 2022. Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In ECCV.
[16]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. TPAMI 39, 4 (2017), 664--676.
[17]
Nora Kassner and Hinrich Schütze. 2020. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. In ACL.
[18]
Aditya Khandelwal and Suraj T. Sawant. 2020. NegBERT: A Transfer Learning Approach for Negation Detection and Scope Resolution. In LREC.
[19]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV: Fully Deep Learning for Ad-hoc Video Search. In ACMMM.
[20]
Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2021. SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. TMM 23 (2021), 4351--4362.
[21]
Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, and Youxin Chen. 2021. Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. In ACMMM.
[22]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.
[23]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use What You Have: Video Retrieval Using Representations from Collaborative Experts. In BMVC.
[24]
Jakub Loko, Tomá Souek, Patrik Veselý, Frantiek Mejzlík, Jiaqi Ji, Chaoxi Xu, and Xirong Li. 2020. A W2VV Case Study with Automated and Interactive Text-to-Video Retrieval. In ACMMM.
[25]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021).
[26]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy- Chowdhury. 2018. Learning joint embedding with multimodal cues for crossmodal video-text retrieval. In ICMR.
[27]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.
[28]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. In ICML.
[29]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Videoand- Language Research. In ICCV.
[30]
Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. In ACMMM.
[31]
Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical Alignment Networks for Video-Text Retrieval. In ACMMM.
[32]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video And Language. In CVPR.
[33]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. In SIGIR.
[34]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.
[35]
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A Joint Sequence Fusion Model for Video Question Answering and Retrieval. In ECCV.
[36]
Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching. In ACMMM.

Cited By

View all
  • (2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
  • (2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
  • (2024)MQuA: Multi-level Query-Video Augmentation for Multilingual Video Corpus RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_31(353-364)Online publication date: 1-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. negation learning
  2. nt2vr
  3. text-to-video retrieval (t2vr)

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
  • (2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
  • (2024)MQuA: Multi-level Query-Video Augmentation for Multilingual Video Corpus RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_31(353-364)Online publication date: 1-Nov-2024
  • (2024)Beyond Coarse-Grained Matching in Video-Text RetrievalComputer Vision – ACCV 202410.1007/978-981-96-0908-6_2(25-43)Online publication date: 7-Dec-2024
  • (2023)Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612006(4626-4636)Online publication date: 26-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media