More Web Proxy on the site http://driver.im/

short-paper

Video Retrieval for Everyday Scenes With Common Objects

Authors:

Arun Zachariah,

Praveen RaoAuthors Info & Claims

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Pages 565 - 570

https://doi.org/10.1145/3591106.3592239

Published: 12 June 2023 Publication History

Abstract

We propose a video retrieval system for everyday scenes with common objects. Our system exploits the predictions made by deep neural networks for image understanding tasks using natural language processing (NLP). It aims to capture the relationships between objects in a video scene as well as the ordering of the matching scenes. For each video in the database, it identifies and generates a sequence of key scene images. For each such scene, it generates most probable captions using state-of-the-art models for image captioning. The captions are parsed and represented by tree structures using NLP techniques. These are then stored and indexed in a database system. When a user poses a query video, a sequence of key scenes are generated. For each scene, its caption is generated using deep learning and parsed into its corresponding tree structure. After that, optimized tree-pattern queries are constructed and executed on the database to retrieve a set of candidate videos. Finally, these candidate videos are ranked using a combination of longest common subsequence of scene matches and tree-edit distance between parse trees. We evaluated the performance of our system using the MSR-VTT dataset, which contained everyday scenes. We observed that our system achieved higher mean average precision (mAP) compared to two recent techniques, namely, CSQ and DnS.

References

[1]

2007. BaseX | The XML Framework: Lightweight and High-Performance Data Processing. Retrieved July 1, 2022 from https://basex.org

[2]

2014. PySceneDetect. Retrieved July 1, 2022 from http://scenedetect.com/en/latest/

[3]

Aasif Ansari and Muzammil H Mohammed. 2015. Content Based Video Retrieval Systems-Methods, Techniques, Trends and Challenges. International Journal of Computer Applications 112, 7 (2015), 13–22.

[4]

Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7804–7813.

[5]

Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernandez, Michael Kay, Jonathan Robie, and Jerome Simeon. 2002. XML Path Language (XPath) 2.0 W3C Working Draft 16. Technical Report WD-xpath20-20020816. World Wide Web Consortium.

[6]

Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition. In British Machine Vision Conference. 1–14.

[7]

Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2011. Million-Scale near-Duplicate Video Retrieval System. In Proceedings of the 19th ACM International Conference on Multimedia. Scottsdale, Arizona, USA, 837–838.

Digital Library

[8]

Liangliang Cao, Zhenguo Li, Yadong Mu, and Shih-Fu Chang. 2012. Submodular Video Hashing: A Unified Framework Towards Video Pooling and Indexing. In Proc. of the 20th ACM International Conference on Multimedia. Nara, Japan, 299–308.

Digital Library

[9]

Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. 2015. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos. IEEE Transactions on Multimedia 17, 3 (2015), 382–395.

Digital Library

[10]

Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The Design and Operation of CloudLab. In 2019 USENIX Annual Technical Conference. 1–14.

[11]

Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video Re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 1–16.

[12]

Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A Unified Framework for Event Retrieval, Recognition and Recounting. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2253–2262.

[13]

Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep Image Retrieval: Learning Global Representations for Image Search. In Computer Vision - ECCV 2016. 241–257.

[14]

Christian Grün, Sebastian Gath, Alexander Holupirek, and Marc H. Scholl. 2009. XQuery Full Text Implementation in BaseX. In Proc. of the 6th International XML Database Symposium on Database and XML Technologies. 114–128.

Digital Library

[15]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video Representation Learning by Dense Predictive Coding. In Proc. of the IEEE International Conference on Computer Vision Workshops. 1–10.

[16]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Memory-Augmented Dense Predictive Coding for Video Representation Learning. In Computer Vision - ECCV 2020. 312–329.

[17]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).

[18]

Yu-Gang Jiang, Yudong Jiang, and Jiajun Wang. 2014. VCDB: A Large-Scale Database for Partial Copy Detection in Videos. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 357–371.

[19]

Yu-Gang Jiang and Jiajun Wang. 2016. Partial Copy Detection in Videos: A Benchmark and an Evaluation of Popular Methods. IEEE Transactions on Big Data 2, 1 (2016), 32–42.

[20]

Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall, USA.

Digital Library

[21]

Yannis Kalantidis, Clayton Mellina, and Simon Osindero. 2016. Cross-Dimensional Weighting for Aggregated Deep Convolutional Features. In Computer Vision - ECCV 2016 Workshops. 685–701.

[22]

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019. ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning. In Proc. of the IEEE International Conference on Computer Vision. 6351–6360.

[23]

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017. Near-Duplicate Video Retrieval with Deep Metric Learning. In Proc. of the IEEE International Conference on Computer Vision Workshops. 347–356.

[24]

Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, Ioannis Kompatsiaris, and Ioannis Patras. 2021. DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. arXiv preprint arXiv:2106.13266 (2021).

[25]

Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, Soren Schwertfeger, Cyrill Stachniss, and Mu Li. 2021. Video Contrastive Learning with Global Context. In Proc. of the IEEE International Conference on Computer Vision. 3195–3204.

[26]

Siying Liang and Ping Wang. 2020. An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic Features. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I (Daejeon, Korea (Republic of)). Springer-Verlag, 752–763.

Digital Library

[27]

Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Deep Video Hashing. IEEE Transactions on Multimedia 19, 6 (2017), 1209–1219.

Digital Library

[28]

Hao Liu, Qingjie Zhao, Hao Wang, Peng Lv, and Yanming Chen. 2017. An Image-Based Near-Duplicate Video Retrieval and Localization Using Improved Edit Distance. Multimedia Tools and Applications 76, 22 (2017), 24435–24456.

Digital Library

[29]

Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip Prefix for Image Captioning. arXiv preprint arXiv:2111.09734 (2021).

[30]

Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proc. of 2017 IEEE International Conference on Computer Vision. 1–10.

[31]

Sébastien Poullot, Shunsuke Tsukatani, Phuong Anh Nguyen, Hervé Jégou, and Shin’ichi Satoh. 2015. Temporal Matching Kernel with Explicit Feature Maps. In Proceedings of the 23rd ACM International Conference on Multimedia. 381–390.

Digital Library

[32]

Jérôme Revaud, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2013. Event Retrieval in Large Video Collections with Circulant Temporal Encoding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2459–2466.

Digital Library

[33]

Amaia Salvador, Xavier Giro-i Nieto, Ferran Marques, and Shin’ichi Satoh. 2016. Faster R-CNN Features for Instance Search. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1–8.

[34]

J. Shao, X. Wen, B. Zhao, and X. Xue. 2021. Temporal Context Aggregation for Video Retrieval with Contrastive Learning. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, Los Alamitos, CA, USA, 3267–3277. https://doi.org/10.1109/WACV48630.2021.00331

[35]

Jie Shao, Xin Wen, Bingchen Zhao, and Xiangyang Xue. 2021. Temporal Context Aggregation for Video Retrieval with Contrastive Learning. In Proc. of the IEEE Winter Conference on Applications of Computer Vision. 3268–3278.

[36]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.

[37]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-Time Large Scale near-Duplicate Video Retrieval. In Proceedings of the 19th ACM International Conference on Multimedia. Scottsdale, Arizona, 423–432.

Digital Library

[38]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-Time Large Scale Near-Duplicate Video Retrieval. In Proc. of the 19th ACM International Conference on Multimedia. 423–432.

Digital Library

[39]

Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, and Tat-Seng Chua. 2009. Scalable Detection of Partial Near-Duplicate Videos by Visual-Temporal Consistency. In Proc. of the 17th ACM International Conference on Multimedia. 145–154.

Digital Library

[40]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MS COCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 652–663.

Digital Library

[41]

Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, and Shang-Hong Lai. 2020. Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval. In Proc. of International Conference on Pattern Recognition (ICPR). 5360–5367.

[42]

Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical Elimination of Near-Duplicates from Web Video Search. In Proc. of the 15th ACM International Conference on Multimedia. 218–227.

Digital Library

[43]

Xiao Wu, Alexander G. Hauptmann, and Chong-Wah Ngo. 2007. Practical Elimination of Near-Duplicates from Web Video Search. In Proceedings of the 15th ACM International Conference on Multimedia. Augsburg, Germany, 218–227.

Digital Library

[44]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.

[45]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. of the 32nd International Conference on Machine Learning. 2048–2057.

[46]

Yuanyuan Yang, Yonghong Tian, and Tiejun Huang. 2019. Multiscale video sequence matching for near-duplicate detection and retrieval. Multimedia Tools and Applications 78, 1 (2019), 311–336.

Digital Library

[47]

Guangnan Ye, Dong Liu, Jun Wang, and Shih-Fu Chang. 2013. Large-Scale Video Hashing via Structure Learning. In Proc. of the IEEE International Conference on Computer Vision. 2272–2279.

Digital Library

[48]

Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, and Jiashi Feng. 2020. Central Similarity Quantization for Efficient Image and Video Retrieval. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 3083–3092.

[49]

Arun Zachariah, Mohamed Gharibi, and Praveen Rao. 2020. QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects. In Proc. of the 2020 International Conference on Multimedia Retrieval. 126–135.

Digital Library

[50]

Arun Zachariah, Mohamed Gharibi, and Praveen Rao. 2021. A Large-Scale Image Retrieval System for Everyday Scenes. In Proc. of the 2nd ACM International Conference on Multimedia in Asia. Article 72, 3 pages.

Digital Library

Cited By

Zhu KZhao LGe ZZhang XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Self-Supervised Visual Preference AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680993(291-300)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680993
Li WLu YHsiao WTseng YWang M(2024)DRM-SN: Detecting Reused Multimedia Content on Social Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00033(169-175)Online publication date: 7-Aug-2024
https://doi.org/10.1109/MIPR62202.2024.00033
Sakurai TTanaka YSekiguchi YNakamura S(2024)Manga Scene Estimation by Quiz Question and AnswerProcedia Computer Science10.1016/j.procs.2024.09.161246(3878-3888)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.09.161

Recommendations

Tracking video objects in cluttered background

We present an algorithm for tracking video objects which is based on a hybrid strategy. This strategy uses both object and region information to solve the correspondence problem. Low-level descriptors are exploited to track object's regions and to cope ...
Multimodal Video Retrieval with the 2017 IMOTION System
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

The IMOTION system is a multimodal content-based video search and browsing application offering a rich set of query modes on the basis of a broad range of different features. It is able to scale with the size of the collection due to its underlying ...
Multiple Human Objects Tracking in Crowded Scenes
ICPR '06: Proceedings of the 18th International Conference on Pattern Recognition - Volume 03

This paper introduces a multiple human objects tracking system to detect and track multiple objects in the crowded scene in which occlusions occur. Our method assign each pixel to different human object based on its relative distance to that object and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

June 2023

694 pages

ISBN:9798400701788

DOI:10.1145/3591106

Editors:
Ioannis (Yiannis) Kompatsiaris
Centre for Research and Technology Hellas, Greece
,
Jiebo Luo
University of Rochester,USA
,
Nicu Sebe
University of Trento, Italy
,
Angela Yao
National University of Singapore, Singapore
,
Vasileios Mezaris
Centre for Research and Technology Hellas, Greece
,
Symeon Papadopoulos
Centre for Research and Technology Hellas, Greece
,
Adrian Popescu
CEA LIST, France
,
Zi (Helen) Huang
University of Queensland, Australia

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

ICMR '23

Sponsor:

SIGMM

ICMR '23: International Conference on Multimedia Retrieval

June 12 - 15, 2023

Thessaloniki, Greece

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
130
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu KZhao LGe ZZhang XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Self-Supervised Visual Preference AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680993(291-300)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680993
Li WLu YHsiao WTseng YWang M(2024)DRM-SN: Detecting Reused Multimedia Content on Social Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00033(169-175)Online publication date: 7-Aug-2024
https://doi.org/10.1109/MIPR62202.2024.00033
Sakurai TTanaka YSekiguchi YNakamura S(2024)Manga Scene Estimation by Quiz Question and AnswerProcedia Computer Science10.1016/j.procs.2024.09.161246(3878-3888)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.09.161

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents