More Web Proxy on the site http://driver.im/

research-article

Person-action Instance Search in Story Videos: An Experimental Study

Authors:

Zhongyuan Wang,

Jiahao GuoAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 2

Article No.: 46, Pages 1 - 34

https://doi.org/10.1145/3617892

Published: 07 November 2023 Publication History

Abstract

Person-Action instance search (P-A INS) aims to retrieve the instances of a specific person doing a specific action, which appears in the 2019–2021 INS tasks of the world-famous TREC Video Retrieval Evaluation (TRECVID). Most of the top-ranking solutions can be summarized with a Division-Fusion-Optimization (DFO) framework, in which person and action recognition scores are obtained separately, then fused, and, optionally, further optimized to generate the final ranking. However, TRECVID only evaluates the final ranking results, ignoring the effects of intermediate steps and their implementation methods. We argue that conducting the fine-grained evaluations of intermediate steps of DFO framework will (1) provide a quantitative analysis of the different methods’ performance in intermediate steps; (2) find out better design choices that contribute to improving retrieval performance; and (3) inspire new ideas for future research from the limitation analysis of current techniques. Particularly, we propose an indirect evaluation method motivated by the leave-one-out strategy, which finds an optimal solution surpassing the champion teams in 2020–2021 INS tasks. Moreover, to validate the generalizability and robustness of the proposed solution under various scenarios, we specifically construct a new large-scale P-A INS dataset and conduct comparative experiments with both the leading NIST TRECVID INS solution and the state-of-the-art P-A INS method. Finally, we discuss the limitations of our evaluation work and suggest future research directions.

References

[1]

George Awad, Asad Butt, Keith Curtis, Jonathan G. Fiscus, Afzal A. Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth Jones, and Georges Quenot. 2021. Evaluating multiple video understanding and retrieval tasks at TRECVID 2021. In Proceedings of the TREC Video Retrieval Evaluation. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/tv21overview.pdf

[2]

Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). IEEE, 67–74.

Digital Library

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[4]

Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learning to detect human-object interactions. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 381–389.

[5]

Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. 2021. Reformulating HOI detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9004–9013.

[6]

Yin Cui, Dong Liu, Jiawei Chen, and Shih-Fu Chang. 2014. Building a large concept bank for representing events in video. arXiv preprint arXiv:1403.7591 (2014).

[7]

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5203–5212.

[8]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.

[9]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.

[10]

Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Trans. Pattern Anal. Mach. Intell. 38, 12 (2016), 2389–2401.

Digital Library

[11]

Hiren Galiyawala, Kenil Shah, Vandit Gajjar, and Mehul S. Raval. 2018. Person retrieval in surveillance video using height, color and gender. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS’18). IEEE, 1–6.

[12]

Cuixiang Guo. 2023. Research on sports video retrieval algorithm based on semantic feature extraction. Multim. Tools Applic. 82 (2023), 21941–21955.

[13]

Ijaz Ul Haq, Khan Muhammad, Amin Ullah, and Sung Wook Baik. 2019. DeepStar: Detecting starring characters in movies. IEEE Access 7 (2019), 9265–9272.

[14]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546–6555.

[15]

Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. StNet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8401–8408.

Digital Library

[16]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.

[17]

W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. 2007. Semantic-based surveillance video retrieval. IEEE Trans. Image Process. 16 (2007), p.1168–1181.

Digital Library

[18]

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the 16th European Conference on Computer Vision. Springer, 709–727.

Digital Library

[19]

Yuko Iinuma and Shin’ichi Satoh. 2021. Video action retrieval using action recognition model. In Proceedings of the International Conference on Multimedia Retrieval. 603–606.

Digital Library

[20]

Longxiang Jiang, Jingyao Yang, Erxuan Guo, Fan Xia, Ruxing Meng, Jingfeng Luo, Xiangyu Li, Xinyi Yan, Zengmin Xu, and Chao Liang. 2019. WHU-NERCMS at TRECVID2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/whu_nercms.pdf

[21]

Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. 494–501.

Digital Library

[22]

Martin Klinkigt, Duy-Dinh Le, Atsushi Hiroike, Hung-Quoc Vo, Mohit Chabra, Vu-Minh-Hieu Dang, Quan Kong, Vinh-Tiep Nguyen, Tomokazu Murakami, Tien-Van Do, Tomoaki Yoshinaga, Duy-Nhat Nguyen, Sinha Saptarshi, Thanh-Duc Ngo, Charles Limasanches, Tushar Agrawal, Jian Manish Vora, Manikandan Ravikiran, Zheng Wang, and Shin'ichi Satoh. 2019. NII Hitachi UIT at TRECVID 2019. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/nii_hitachi_uit.pdf

[23]

Duy-Dinh Le, Hung-Quoc Vo, Dung-Minh Nguyen, Tien-Van Do, Thinh-Le-Gia Pham, Tri-Le-Minh Vo, Thua-Ngoc Nguyen, Vinh-Tiep Nguyen, Thanh-Duc Ngo, Zheng Wang, and Shin’ichi Satoh. 2020. NII_UIT AT TRECVID 2020. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/nii_uit.pdf

[24]

Ya Li, Guanyu Chen, Xiangqian Cheng, Chong Chen, Shaoqiang Xu, Xinyu Li, Xuanlu Xiang, Yanyun Zhao, Zhicheng Zhao, and Fei Su. 2019. BUPT-MCPRL at TRECVID 2019: ActEV and INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/bupt-mcprl.pdf

[25]

Chao Liang, Changsheng Xu, Jian Cheng, and Hanqing Lu. 2011. TVParser: An automatic TV video parsing method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3377–3384.

Digital Library

[26]

Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.

[27]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.

[28]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.

[29]

Robert McKee. 2010. Story: Style, Structure, Substance, and the Principles of Screenwriting. HarperCollins e-books.

[30]

Jingjing Meng, Junsong Yuan, Jiong Yang, Gang Wang, and Yap-Peng Tan. 2015. Object instance search in videos via spatio-temporal trajectory discovery. IEEE Trans. Multim. 18, 1 (2015), 116–127.

Digital Library

[31]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[32]

Sosuke Mizuno and Keiji Yanai. 2020. UEC at TRECVID 2020: INS and ActEV. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/uec.pdf

[33]

Majid Mohammadi and Jafar Rezaei. 2020. Ensemble ranking: Aggregation of rankings produced by different multi-criteria decision-making methods. Omega 96 (2020), 102254.

[34]

Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multim. 13, 3 (2006), 86–91.

Digital Library

[35]

Yanrui Niu, Jingyao Yang, Chao Liang, Baojin Huang, and Zhongyuan Wang. 2023. A spatio-temporal identity verification method for person-action instance search in movies. In Proceedings of the 29th International Conference on MultiMedia Modeling. Springer, 82–94.

Digital Library

[36]

Yanrui Niu, Jingyao Yang, Ankang Lu, Baojin Huang, Yue Zhang, Ji Huang, Shishi Wen, Dongshu Xu, Chao Liang, Zhongyuan Wang, and Jun Chen. 2021. WHU-NERCMS at TRECVID2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/whu-nercms.pdf

[37]

Jianbo Ouyang, Hui Wu, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Contextual similarity aggregation with self-attention for visual re-ranking. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3135–3148.

[38]

Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the British Machine Vision Conference 2015 (BMVC 2015, Swansea, UK, September 7-10, 2015) Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, 41.1–41.12.

[39]

Yuxin Peng, Xin Huang, Jinwei Qi, Junjie Zhao, Junchao Zhang, Yunzhen Zhao, Yuxin Yuan, Xiangteng He, and Jian Zhang. 2019. PKU-ICST at TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/pku-icst.pdf

[40]

Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2020. PKU_WICT at TRECVID 2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/pku-wict.pdf

[41]

Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2021. PKU_WICT at TRECVID 2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/pku_wict.pdf

[42]

Robi Polikar. 2012. Ensemble learning. In Ensemble Machine Learning. Springer, 1–34.

[43]

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146–10155.

[44]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.

[45]

Prashant Giridhar Shambharkar, Umesh Kumar Nimesh, Nihal Kumar, Vj Duy Du, and M. N. Doja. 2021. Automatic face recognition and finding occurrence of actors in movies. In Inventive Communication and Computational Technologies. Springer, 115–129.

[46]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[47]

Henrique Siqueira, Sven Magg, and Stefan Wermter. 2020. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5800–5809.

[48]

Yinan Song, Wenhao Yang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2021. BUPT-MCPRL at TRECVID 2021. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/bupt-mcprl.pdf

[49]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693–5703.

[50]

Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419.

[51]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[52]

Oytun Ulutan, Swati Rallapalli, Mudhakar Srivatsa, Carlos Torres, and B. S. Manjunath. 2020. Actor conditioned attention maps for video action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 527–536.

[53]

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581–8590.

[54]

Hung-Quoc Vo, Dung-Minh Nguyen, Tien Do, Vinh-Tiep Nguyen, Nhat-Duy Nguyen, Thanh Duc Ngo, Duy-Dinh Le, and Shin'ichi Satoh. 2020. Searching for desired person doing desired action based on visual and audio feature in large scale video database. In Proceedings of the International Conference on Multimedia Analysis and Pattern Recognition (MAPR’20). IEEE, 1–6.

[55]

Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29 (2020), 4057–4069.

Digital Library

[56]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.

[57]

Zheng Wang, Fan Yang, and Shin’ichi Satoh. 2019. Salient time slice pruning and boosting for person-scene instance search in TV series. In Proceedings of the ACM Multimedia Asia Conference. 1–6.

Digital Library

[58]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision. Springer, 499–515.

[59]

Changsheng Xu, Jinjun Wang, Hanqing Lu, and Yifan Zhang. 2008. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multim. 10, 3 (2008), 421–436.

Digital Library

[60]

Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. 2007. Columbia university.s baseline detectors for 374 LSCOM semantic visual concepts. Technical Report. Columbia University. Retrieved from http://www.ee.columbia.edu/dvmm/columbia374

[61]

Jingyao Yang, Yanrui Niu Kang’an Chen, Xinyao Fan, and Chao Liang. 2020. WHU-NERCMS at TRECVID2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/whu_nercms.pdf

[62]

Wenhao Yang, Yinan Song, Zhicheng Zhao, and Fei Su. 2021. Instance search via fusing hierarchical multi-level retrieval and human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2323–2327.

[63]

En Yu, Wenhe Liu, Guoliang Kang, Xiaojun Chang, Jiande Sun, and Alexander Hauptmann. 2019. Inf@TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/inf_ins.pdf

[64]

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (Oct. 2016), 1499–1503.

[65]

Qi Zhang, Jiacheng Zhang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2020. BUPT-MCPRL aW TRECVID 2020: INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/bupt-mcprl_ins.pdf

[66]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 695–712.

Digital Library

Index Terms

Person-action Instance Search in Story Videos: An Experimental Study
1. Information systems
  1. Information retrieval

Recommendations

A Spatio-Temporal Identity Verification Method for Person-Action Instance Search in Movies
MultiMedia Modeling
Abstract
As one of the challenging problems in video search, Person-Action Instance Search (P-A INS) aims to retrieve shots with a specific person carrying out a specific action from massive amounts of video shots. Most existing methods conduct person INS ...
An experimental study of passive dynamic walking

A two-straight-legged walking mechanism with flat feet is designed and built to study the passive dynamic gait. It is shown that the mechanism having flat feet can exhibit passive dynamic walking as those with curved feet, but the walking efficiency is ...
A study of results overlap and uniqueness among major web search engines

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 42, Issue 2

March 2024

897 pages

EISSN:1558-2868

DOI:10.1145/3618075

Editor:
Min Zhang
Tsinghua University, China

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2023

Online AM: 29 August 2023

Accepted: 11 August 2023

Revised: 03 July 2023

Received: 26 October 2022

Published in TOIS Volume 42, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
220
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)7

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents