More Web Proxy on the site http://driver.im/

research-article

Learn to Understand Negation in Video Retrieval

Authors:

Xirong LiAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 434 - 443

https://doi.org/10.1145/3503161.3547968

Published: 10 October 2022 Publication History

Abstract

Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with a dog. However, the state-of-the-art deep learning based video retrieval models lack such ability, as they are typically trained on video description datasets such as MSR-VTT and VATEX that lack negated descriptions. Their retrieved results basically ignore the negator in the sample query, incorrectly returning videos showing kids playing with dog. This paper presents the first study on learning to understand negation in video retrieval and make contributions as follows. By re-purposing two existing datasets (MSR-VTT and VATEX), we propose a new evaluation protocol for video retrieval with negation. We propose a learning based method for training a negation-aware video retrieval model. The key idea is to first construct a soft negative caption for a specific training video by partially negating its original caption, and then compute a bidirectionally constrained loss on the triplet. This auxiliary loss is weightedly added to a standard retrieval loss. Experiments on the re-purposed benchmarks show that re-training the CLIP (Contrastive Language-Image Pre-Training) model by the proposed method clearly improves its ability to handle queries with negation. In addition, the model performance on the original benchmarks is also improved.

Supplementary Material

MP4 File (MM22-fp0907.mp4)

Presentation video of Learn to Understand Negation in Video Retrieval

Download
24.34 MB

References

[1]

George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth J. F. Jones, and Georges Quénot. 2021. Evaluating Multiple Video Understanding and Retrieval Tasks at TRECVID 2021. In TRECVID Workshop.

[2]

Aozhu Chen, Fan Hu, Zihan Wang, Fangming Zhou, and Xirong Li. 2021. What Matters for Ad-hoc Video Search? A Large-scale Evaluation on TRECVID. In ICCV Workshop on ViRal.

[3]

David Chen and William Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In CVPR.

[4]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning. In CVPR.

[5]

Matthew Cooper, John Adcock, Robert Chen, and Hanning Zhou. 2005. FXPAL at TRECVID 2005. In TRECVID Workshop.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.

[7]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, and Xun Wang. 2021. Dual Encoding for Video Retrieval by Text. TPAMI 44, 8 (2021), 4065--4080.

[8]

Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. MDMMT: Multidomain Multimodal Transformer for Video Retrieval. In CVPR Workshop on HVU.

[9]

Allyson Ettinger. 2020. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. TACL 8 (2020), 34--48.

[10]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.

[11]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv preprint arXiv:2106.11097 (2021).

[12]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi- Modal Transformer for Video Retrieval. In ECCV.

[13]

Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In ACMMM.

[14]

Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R Devon Hjelm, Alessandro Sordoni, and Aaron Courville. 2021. Understanding by Understanding Not: Modeling Negation in Language Models. In NAACL.

[15]

Fan Hu, Aozhu Chen, ZiyueWang, Fangming Zhou, Jianfeng Dong, and Xirong Li. 2022. Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In ECCV.

[16]

Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. TPAMI 39, 4 (2017), 664--676.

Digital Library

[17]

Nora Kassner and Hinrich Schütze. 2020. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly. In ACL.

[18]

Aditya Khandelwal and Suraj T. Sawant. 2020. NegBERT: A Transfer Learning Approach for Negation Detection and Scope Resolution. In LREC.

[19]

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV: Fully Deep Learning for Ad-hoc Video Search. In ACMMM.

[20]

Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2021. SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. TMM 23 (2021), 4351--4362.

[21]

Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, and Youxin Chen. 2021. Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. In ACMMM.

[22]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.

[23]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use What You Have: Video Retrieval Using Representations from Collaborative Experts. In BMVC.

[24]

Jakub Loko, Tomá Souek, Patrik Veselý, Frantiek Mejzlík, Jiaqi Ji, Chaoxi Xu, and Xirong Li. 2020. A W2VV Case Study with Automated and Interactive Text-to-Video Retrieval. In ACMMM.

[25]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021).

[26]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy- Chowdhury. 2018. Learning joint embedding with multimodal cues for crossmodal video-text retrieval. In ICMR.

[27]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.

Digital Library

[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. In ICML.

[29]

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Videoand- Language Research. In ICCV.

[30]

Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. In ACMMM.

[31]

Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. 2021. HANet: Hierarchical Alignment Networks for Video-Text Retrieval. In ACMMM.

[32]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video And Language. In CVPR.

[33]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. In SIGIR.

[34]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.

[35]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A Joint Sequence Fusion Model for Video Question Answering and Retrieval. In ECCV.

[36]

Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching. In ACMMM.

Cited By

Chen AZhou FWang ZLi X(2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446902
Wang JWang PSun GLiu DDianat SRao RRabbani MTao Z(2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01566
Yu GBi XTang JGu MChen TLi ZZhu M(2024)MQuA: Multi-level Query-Video Augmentation for Multilingual Video Corpus RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_31(353-364)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9443-0_31
Show More Cited By

Index Terms

Learn to Understand Negation in Video Retrieval
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Subminimal negation

Minimal logic, i.e., intuitionistic logic without the ex falso principle, is investigated in its original form with a negation symbol instead of a symbol denoting the contradiction. A Kripke semantics is developed for minimal logic and its sublogics ...
Guarded negation
ICALP'11: Proceedings of the 38th international conference on Automata, languages and programming - Volume Part II

We consider restrictions of first-order logic and of fixpoint logic in which all occurrences of negation are required to be guarded by an atomic predicate. In terms of expressive power, the logics in question, called GNFO and GNFP, extend the guarded ...
Evaluating epistemic negation in answer set programming

Epistemic negation not along with default negation plays a key role in knowledge representation and nonmonotonic reasoning. However, the existing epistemic approaches such as those by Gelfond 13,15,14, Truszczynski 33 and Kahl et al. 18 behave not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Natural Science Foundation
National Natural Science Foundation of China
Public Computing Cloud, Renmin University of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen AZhou FWang ZLi X(2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446902
Wang JWang PSun GLiu DDianat SRao RRabbani MTao Z(2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01566
Yu GBi XTang JGu MChen TLi ZZhu M(2024)MQuA: Multi-level Query-Video Augmentation for Multilingual Video Corpus RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_31(353-364)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9443-0_31
Chen ADoughty HLi XSnoek C(2024)Beyond Coarse-Grained Matching in Video-Text RetrievalComputer Vision – ACCV 202410.1007/978-981-96-0908-6_2(25-43)Online publication date: 7-Dec-2024
https://doi.org/10.1007/978-981-96-0908-6_2
Jiang CLiu HYu XWang QCheng YXu JLiu ZGuo QChu WYang MQi YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612006(4626-4636)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612006

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents