[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581783.3612161acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

AdaCLIP: Towards Pragmatic Multimodal Video Retrieval

Published: 27 October 2023 Publication History

Abstract

Incorporating large image-text foundation models such as CLIP has substantially improved the performance of the multimodal video retrieval task. However, how to practically sample the frames from a video and aggregate the frame features into a video representation is still an open research question. In particular, real-world deployment scenarios, such as embodiment within consumer electronics or cloud-based inference pipelines, require two key facets of retrieval (representation building and search) to be computationally light and fast. In this paper, we propose AdaCLIP, a computation- and latency-aware system for pragmatic multimodal video retrieval. AdaCLIP consists of a learning-based frame selection module to select informative frames and a query-independent frame aggregation module to obtain strong video representations from the frame features. Specifically, in the frame selection module, we introduce a differentiable Hard-Top-k algorithm to sample a subset of the frames while optimizing the performance of the video retrieval task in an end-to-end manner. Moreover, to be latency-aware, we also propose a query-independent lightweight approach, MLP-Score, to aggregate the frame features into the video representation, which offers up to 142x speedup on GPU and 822x speedup on CPU in similarity search time compared to query-dependent matching methods. Experimental results on several popular video retrieval datasets confirm the effectiveness of AdaCLIP.

Supplemental Material

MP4 File
Presentation video for the paper "AdaCLIP: Towards Pragmatic Multimodal Video Retrieval"

References

[1]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. arXiv preprint arXiv:2104.00650 (2021).
[2]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2022. A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508 (2022).
[3]
Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, and Francis Bach. 2020. Learning with Differentiable Pertubed Optimizers. Advances in Neural Information Processing Systems, Vol. 33 (2020), 9508--9519.
[4]
Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. 2022. Cross Modal Retrieval with Querybank Normalisation. In Proc. of the IEEE/CVF CVPR. 5194--5205.
[5]
Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. 2022. Revisiting the "Video" in Video-Language Understanding. In Proc. of the IEEE/CVF CVPR. 2917--2927.
[6]
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. 2021. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. arXiv preprint arXiv:2109.04290 (2021).
[7]
Jean-Baptiste Cordonnier, Aravindh Mahendran, Alexey Dosovitskiy, Dirk Weissenborn, Jakob Uszkoreit, and Thomas Unterthiner. 2021. Differentiable Patch Selection for Image Recognition. In Proceedings of the IEEE/CVF CVPR. 2351--2360.
[8]
Ioana Croitoru, Simion-Vlad Bogolin, Yang Liu, Samuel Albanie, Marius Leordeanu, Hailin Jin, and Andrew Zisserman. 2021. TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. arXiv preprint arXiv:2104.08271 (2021).
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A Large-Scale Hierarchical Image Database. In Proc. of IEEE CVPR. 248--255.
[10]
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. MDMMT: Multidomain Multimodal Transformer for Video Retrieval. In Proc. of the IEEE/CVF CVPR. 3354--3363.
[11]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv preprint arXiv:2106.11097 (2021).
[12]
Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari, Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, and Juergen Gall. 2022. Adaptive Token Sampling for Efficient Vision Transformers. In Proc. of the ECCV.
[13]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Proc. of the ECCV.
[14]
Satya Krishna Gorti, Noël Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-pool: Cross-modal language-video attention for text-video retrieval. In Proc. of the IEEE/CVF CVPR. 5006--5015.
[15]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video with Natural Language. In Proc. of the IEEE ICCV.
[16]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415 (2016).
[17]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for MobileNetV3. In Proc. of the IEEE/CVF ICCV. 1314--1324.
[18]
Zhiming Hu, Lan Xiao, Mete Kemertas, Caleb Phillips, Iqbal Mohomed, and Afsaneh Fazly. 2022a. CrispSearch: Low-Latency On-Device Language-based Image Retrieval. In Proc. of the 13th ACM Multimedia Systems Conference. 62--72.
[19]
Zhiming Hu, Ning Ye, and Iqbal Mohomed. 2022b. mmSampler: Efficient Frame Sampler for Multimodal Video Retrieval. Proc. of Machine Learning and Systems, Vol. 4 (2022), 153--171.
[20]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical Reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144 (2016).
[21]
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, and Gao Huang. 2022b. Cross-Modal Adapter for Text-Video Retrieval. arXiv preprint arXiv:2211.09623 (2022).
[22]
Jie Jiang, Shaobo Min, Weijie Kong, Hongfa Wang, Zhifeng Li, and Wei Liu. 2022a. Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations. IEEE Access (2022).
[23]
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, and Jie Chen. 2022. Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations. Advances in Neural Information Processing Systems, Vol. 35 (2022), 30291--30306.
[24]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proc. of the IEEE ICCV. 706--715.
[25]
Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. 2019. Use What You Have: Video Retrieval using Representations from Collaborative Experts. In arXiv preprint arxiv:1907.13487.
[26]
Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. 2022. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. In Proc. of the ECCV.
[27]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An Empirical Study of Clip for End to End Video Clip Retrieval. arXiv preprint arXiv:2104.08860 (2021).
[28]
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-Grained Contrastive Learning for Video-Text Retrieval. In Proc. of the 30th ACM International Conference on Multimedia. 638--647.
[29]
Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris. 2020. AR-Net: Adaptive Frame Resolution for Efficient Action Recognition. In Proc. of the ECCV. Springer, 86--104.
[30]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv:1804.02516 (2018).
[31]
Jesús Andrés Portillo-Quintero et al. 2021. A Straightforward Framework for Video Retrieval Using CLIP. In Mexican Conference on Pattern Recognition. Springer, 3--12.
[32]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021).
[33]
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. Advances in Neural Information Processing Systems, Vol. 34 (2021), 13937--13949.
[34]
Kai-Yu Tseng, Yen-Liang Lin, Yu-Hsiu Chen, and Winston H Hsu. 2012. Sketch-Based Image Retrieval on Mobile Devices using Compact Hash Bits. In Proc. of the 20th ACM international conference on Multimedia. 913--916.
[35]
Junke Wang, Xitong Yang, Hengduo Li, Liu Li, Zuxuan Wu, and Yu-Gang Jiang. 2022a. Efficient Video Transformers with Spatial-Temporal Token Selection. In Proc. of the ECCV.
[36]
Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, and Xian-Sheng Hua. 2022b. Disentangled Representation Learning for Text-Video Retrieval. arXiv preprint arXiv:2203.07111 (2022).
[37]
Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021. Adaptive Focus for Efficient Video Recognition. In Proc. of the IEEE/CVF ICCV.
[38]
Sang Michael Xie and Stefano Ermon. 2019. Reparameterizable Subset Sampling via Continuous Relaxations. arXiv preprint arXiv:1901.10517 (2019).
[39]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSRVTT: A Large Video Description Dataset for Bridging Video and Language. In Proc. of the IEEE/CVF CVPR. 5288--5296.
[40]
Ran Xu, Chen-lin Zhang, Pengcheng Wang, Jayoung Lee, Subrata Mitra, Somali Chaterji, Yin Li, and Saurabh Bagchi. 2020. ApproxDet: Content and Contention-Aware Approximate Object Detection for Mobiles. In Proc. of the 18th Conference on Embedded Networked Sensor Systems. 449--462.
[41]
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. 2022. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. arXiv preprint arXiv:2209.06430 (2022).
[42]
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. Centerclip: Token Clustering for Efficient Text-Video Retrieval. In Proc. of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 970--981.

Cited By

View all
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Realizing Efficient On-Device Language-based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364989620:9(1-18)Online publication date: 16-Aug-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. multimodal learning
  3. multimodal retrieval

Qualifiers

  • Research-article

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)859
  • Downloads (Last 6 weeks)165
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Realizing Efficient On-Device Language-based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364989620:9(1-18)Online publication date: 16-Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media