[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3591106.3592239acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Video Retrieval for Everyday Scenes With Common Objects

Published: 12 June 2023 Publication History

Abstract

We propose a video retrieval system for everyday scenes with common objects. Our system exploits the predictions made by deep neural networks for image understanding tasks using natural language processing (NLP). It aims to capture the relationships between objects in a video scene as well as the ordering of the matching scenes. For each video in the database, it identifies and generates a sequence of key scene images. For each such scene, it generates most probable captions using state-of-the-art models for image captioning. The captions are parsed and represented by tree structures using NLP techniques. These are then stored and indexed in a database system. When a user poses a query video, a sequence of key scenes are generated. For each scene, its caption is generated using deep learning and parsed into its corresponding tree structure. After that, optimized tree-pattern queries are constructed and executed on the database to retrieve a set of candidate videos. Finally, these candidate videos are ranked using a combination of longest common subsequence of scene matches and tree-edit distance between parse trees. We evaluated the performance of our system using the MSR-VTT dataset, which contained everyday scenes. We observed that our system achieved higher mean average precision (mAP) compared to two recent techniques, namely, CSQ and DnS.

References

[1]
2007. BaseX | The XML Framework: Lightweight and High-Performance Data Processing. Retrieved July 1, 2022 from https://basex.org
[2]
2014. PySceneDetect. Retrieved July 1, 2022 from http://scenedetect.com/en/latest/
[3]
Aasif Ansari and Muzammil H Mohammed. 2015. Content Based Video Retrieval Systems-Methods, Techniques, Trends and Challenges. International Journal of Computer Applications 112, 7 (2015), 13–22.
[4]
Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7804–7813.
[5]
Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fernandez, Michael Kay, Jonathan Robie, and Jerome Simeon. 2002. XML Path Language (XPath) 2.0 W3C Working Draft 16. Technical Report WD-xpath20-20020816. World Wide Web Consortium.
[6]
Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. 2019. TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition. In British Machine Vision Conference. 1–14.
[7]
Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2011. Million-Scale near-Duplicate Video Retrieval System. In Proceedings of the 19th ACM International Conference on Multimedia. Scottsdale, Arizona, USA, 837–838.
[8]
Liangliang Cao, Zhenguo Li, Yadong Mu, and Shih-Fu Chang. 2012. Submodular Video Hashing: A Unified Framework Towards Video Pooling and Indexing. In Proc. of the 20th ACM International Conference on Multimedia. Nara, Japan, 299–308.
[9]
Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. 2015. Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos. IEEE Transactions on Multimedia 17, 3 (2015), 382–395.
[10]
Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The Design and Operation of CloudLab. In 2019 USENIX Annual Technical Conference. 1–14.
[11]
Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video Re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 1–16.
[12]
Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A Unified Framework for Event Retrieval, Recognition and Recounting. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2253–2262.
[13]
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep Image Retrieval: Learning Global Representations for Image Search. In Computer Vision - ECCV 2016. 241–257.
[14]
Christian Grün, Sebastian Gath, Alexander Holupirek, and Marc H. Scholl. 2009. XQuery Full Text Implementation in BaseX. In Proc. of the 6th International XML Database Symposium on Database and XML Technologies. 114–128.
[15]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video Representation Learning by Dense Predictive Coding. In Proc. of the IEEE International Conference on Computer Vision Workshops. 1–10.
[16]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Memory-Augmented Dense Predictive Coding for Video Representation Learning. In Computer Vision - ECCV 2020. 312–329.
[17]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015).
[18]
Yu-Gang Jiang, Yudong Jiang, and Jiajun Wang. 2014. VCDB: A Large-Scale Database for Partial Copy Detection in Videos. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 357–371.
[19]
Yu-Gang Jiang and Jiajun Wang. 2016. Partial Copy Detection in Videos: A Benchmark and an Evaluation of Popular Methods. IEEE Transactions on Big Data 2, 1 (2016), 32–42.
[20]
Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall, USA.
[21]
Yannis Kalantidis, Clayton Mellina, and Simon Osindero. 2016. Cross-Dimensional Weighting for Aggregated Deep Convolutional Features. In Computer Vision - ECCV 2016 Workshops. 685–701.
[22]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019. ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning. In Proc. of the IEEE International Conference on Computer Vision. 6351–6360.
[23]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017. Near-Duplicate Video Retrieval with Deep Metric Learning. In Proc. of the IEEE International Conference on Computer Vision Workshops. 347–356.
[24]
Giorgos Kordopatis-Zilos, Christos Tzelepis, Symeon Papadopoulos, Ioannis Kompatsiaris, and Ioannis Patras. 2021. DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. arXiv preprint arXiv:2106.13266 (2021).
[25]
Haofei Kuang, Yi Zhu, Zhi Zhang, Xinyu Li, Joseph Tighe, Soren Schwertfeger, Cyrill Stachniss, and Mu Li. 2021. Video Contrastive Learning with Global Context. In Proc. of the IEEE International Conference on Computer Vision. 3195–3204.
[26]
Siying Liang and Ping Wang. 2020. An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic Features. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I (Daejeon, Korea (Republic of)). Springer-Verlag, 752–763.
[27]
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Deep Video Hashing. IEEE Transactions on Multimedia 19, 6 (2017), 1209–1219.
[28]
Hao Liu, Qingjie Zhao, Hao Wang, Peng Lv, and Yanming Chen. 2017. An Image-Based Near-Duplicate Video Retrieval and Localization Using Improved Edit Distance. Multimedia Tools and Applications 76, 22 (2017), 24435–24456.
[29]
Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip Prefix for Image Captioning. arXiv preprint arXiv:2111.09734 (2021).
[30]
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proc. of 2017 IEEE International Conference on Computer Vision. 1–10.
[31]
Sébastien Poullot, Shunsuke Tsukatani, Phuong Anh Nguyen, Hervé Jégou, and Shin’ichi Satoh. 2015. Temporal Matching Kernel with Explicit Feature Maps. In Proceedings of the 23rd ACM International Conference on Multimedia. 381–390.
[32]
Jérôme Revaud, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2013. Event Retrieval in Large Video Collections with Circulant Temporal Encoding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 2459–2466.
[33]
Amaia Salvador, Xavier Giro-i Nieto, Ferran Marques, and Shin’ichi Satoh. 2016. Faster R-CNN Features for Instance Search. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 1–8.
[34]
J. Shao, X. Wen, B. Zhao, and X. Xue. 2021. Temporal Context Aggregation for Video Retrieval with Contrastive Learning. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE Computer Society, Los Alamitos, CA, USA, 3267–3277. https://doi.org/10.1109/WACV48630.2021.00331
[35]
Jie Shao, Xin Wen, Bingchen Zhao, and Xiangyang Xue. 2021. Temporal Context Aggregation for Video Retrieval with Contrastive Learning. In Proc. of the IEEE Winter Conference on Applications of Computer Vision. 3268–3278.
[36]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.
[37]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-Time Large Scale near-Duplicate Video Retrieval. In Proceedings of the 19th ACM International Conference on Multimedia. Scottsdale, Arizona, 423–432.
[38]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-Time Large Scale Near-Duplicate Video Retrieval. In Proc. of the 19th ACM International Conference on Multimedia. 423–432.
[39]
Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, and Tat-Seng Chua. 2009. Scalable Detection of Partial Near-Duplicate Videos by Visual-Temporal Consistency. In Proc. of the 17th ACM International Conference on Multimedia. 145–154.
[40]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MS COCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 652–663.
[41]
Kuan-Hsun Wang, Chia Chun Cheng, Yi-Ling Chen, Yale Song, and Shang-Hong Lai. 2020. Attention-Based Deep Metric Learning for Near-Duplicate Video Retrieval. In Proc. of International Conference on Pattern Recognition (ICPR). 5360–5367.
[42]
Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical Elimination of Near-Duplicates from Web Video Search. In Proc. of the 15th ACM International Conference on Multimedia. 218–227.
[43]
Xiao Wu, Alexander G. Hauptmann, and Chong-Wah Ngo. 2007. Practical Elimination of Near-Duplicates from Web Video Search. In Proceedings of the 15th ACM International Conference on Multimedia. Augsburg, Germany, 218–227.
[44]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.
[45]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. of the 32nd International Conference on Machine Learning. 2048–2057.
[46]
Yuanyuan Yang, Yonghong Tian, and Tiejun Huang. 2019. Multiscale video sequence matching for near-duplicate detection and retrieval. Multimedia Tools and Applications 78, 1 (2019), 311–336.
[47]
Guangnan Ye, Dong Liu, Jun Wang, and Shih-Fu Chang. 2013. Large-Scale Video Hashing via Structure Learning. In Proc. of the IEEE International Conference on Computer Vision. 2272–2279.
[48]
Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, and Jiashi Feng. 2020. Central Similarity Quantization for Efficient Image and Video Retrieval. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition. 3083–3092.
[49]
Arun Zachariah, Mohamed Gharibi, and Praveen Rao. 2020. QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects. In Proc. of the 2020 International Conference on Multimedia Retrieval. 126–135.
[50]
Arun Zachariah, Mohamed Gharibi, and Praveen Rao. 2021. A Large-Scale Image Retrieval System for Everyday Scenes. In Proc. of the 2nd ACM International Conference on Multimedia in Asia. Article 72, 3 pages.

Cited By

View all
  • (2024)Self-Supervised Visual Preference AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680993(291-300)Online publication date: 28-Oct-2024
  • (2024)DRM-SN: Detecting Reused Multimedia Content on Social Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00033(169-175)Online publication date: 7-Aug-2024
  • (2024)Manga Scene Estimation by Quiz Question and AnswerProcedia Computer Science10.1016/j.procs.2024.09.161246(3878-3888)Online publication date: 2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
June 2023
694 pages
ISBN:9798400701788
DOI:10.1145/3591106
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NLP
  2. Video retrieval
  3. XML
  4. indexing
  5. ranking
  6. scene captioning

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ICMR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)6
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Self-Supervised Visual Preference AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680993(291-300)Online publication date: 28-Oct-2024
  • (2024)DRM-SN: Detecting Reused Multimedia Content on Social Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00033(169-175)Online publication date: 7-Aug-2024
  • (2024)Manga Scene Estimation by Quiz Question and AnswerProcedia Computer Science10.1016/j.procs.2024.09.161246(3878-3888)Online publication date: 2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media