[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Person-action Instance Search in Story Videos: An Experimental Study

Published: 07 November 2023 Publication History

Abstract

Person-Action instance search (P-A INS) aims to retrieve the instances of a specific person doing a specific action, which appears in the 2019–2021 INS tasks of the world-famous TREC Video Retrieval Evaluation (TRECVID). Most of the top-ranking solutions can be summarized with a Division-Fusion-Optimization (DFO) framework, in which person and action recognition scores are obtained separately, then fused, and, optionally, further optimized to generate the final ranking. However, TRECVID only evaluates the final ranking results, ignoring the effects of intermediate steps and their implementation methods. We argue that conducting the fine-grained evaluations of intermediate steps of DFO framework will (1) provide a quantitative analysis of the different methods’ performance in intermediate steps; (2) find out better design choices that contribute to improving retrieval performance; and (3) inspire new ideas for future research from the limitation analysis of current techniques. Particularly, we propose an indirect evaluation method motivated by the leave-one-out strategy, which finds an optimal solution surpassing the champion teams in 2020–2021 INS tasks. Moreover, to validate the generalizability and robustness of the proposed solution under various scenarios, we specifically construct a new large-scale P-A INS dataset and conduct comparative experiments with both the leading NIST TRECVID INS solution and the state-of-the-art P-A INS method. Finally, we discuss the limitations of our evaluation work and suggest future research directions.

References

[1]
George Awad, Asad Butt, Keith Curtis, Jonathan G. Fiscus, Afzal A. Godil, Yooyoung Lee, Andrew Delgado, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth Jones, and Georges Quenot. 2021. Evaluating multiple video understanding and retrieval tasks at TRECVID 2021. In Proceedings of the TREC Video Retrieval Evaluation. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/tv21overview.pdf
[2]
Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). IEEE, 67–74.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[4]
Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. 2018. Learning to detect human-object interactions. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 381–389.
[5]
Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. 2021. Reformulating HOI detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9004–9013.
[6]
Yin Cui, Dong Liu, Jiawei Chen, and Shih-Fu Chang. 2014. Building a large concept bank for representing events in video. arXiv preprint arXiv:1403.7591 (2014).
[7]
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. RetinaFace: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5203–5212.
[8]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.
[9]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.
[10]
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Trans. Pattern Anal. Mach. Intell. 38, 12 (2016), 2389–2401.
[11]
Hiren Galiyawala, Kenil Shah, Vandit Gajjar, and Mehul S. Raval. 2018. Person retrieval in surveillance video using height, color and gender. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS’18). IEEE, 1–6.
[12]
Cuixiang Guo. 2023. Research on sports video retrieval algorithm based on semantic feature extraction. Multim. Tools Applic. 82 (2023), 21941–21955.
[13]
Ijaz Ul Haq, Khan Muhammad, Amin Ullah, and Sung Wook Baik. 2019. DeepStar: Detecting starring characters in movies. IEEE Access 7 (2019), 9265–9272.
[14]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546–6555.
[15]
Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, and Shilei Wen. 2019. StNet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8401–8408.
[16]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.
[17]
W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. 2007. Semantic-based surveillance video retrieval. IEEE Trans. Image Process. 16 (2007), p.1168–1181.
[18]
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the 16th European Conference on Computer Vision. Springer, 709–727.
[19]
Yuko Iinuma and Shin’ichi Satoh. 2021. Video action retrieval using action recognition model. In Proceedings of the International Conference on Multimedia Retrieval. 603–606.
[20]
Longxiang Jiang, Jingyao Yang, Erxuan Guo, Fan Xia, Ruxing Meng, Jingfeng Luo, Xiangyu Li, Xinyi Yan, Zengmin Xu, and Chao Liang. 2019. WHU-NERCMS at TRECVID2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/whu_nercms.pdf
[21]
Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. 494–501.
[22]
Martin Klinkigt, Duy-Dinh Le, Atsushi Hiroike, Hung-Quoc Vo, Mohit Chabra, Vu-Minh-Hieu Dang, Quan Kong, Vinh-Tiep Nguyen, Tomokazu Murakami, Tien-Van Do, Tomoaki Yoshinaga, Duy-Nhat Nguyen, Sinha Saptarshi, Thanh-Duc Ngo, Charles Limasanches, Tushar Agrawal, Jian Manish Vora, Manikandan Ravikiran, Zheng Wang, and Shin'ichi Satoh. 2019. NII Hitachi UIT at TRECVID 2019. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/nii_hitachi_uit.pdf
[23]
Duy-Dinh Le, Hung-Quoc Vo, Dung-Minh Nguyen, Tien-Van Do, Thinh-Le-Gia Pham, Tri-Le-Minh Vo, Thua-Ngoc Nguyen, Vinh-Tiep Nguyen, Thanh-Duc Ngo, Zheng Wang, and Shin’ichi Satoh. 2020. NII_UIT AT TRECVID 2020. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/nii_uit.pdf
[24]
Ya Li, Guanyu Chen, Xiangqian Cheng, Chong Chen, Shaoqiang Xu, Xinyu Li, Xuanlu Xiang, Yanyun Zhao, Zhicheng Zhao, and Fei Su. 2019. BUPT-MCPRL at TRECVID 2019: ActEV and INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/bupt-mcprl.pdf
[25]
Chao Liang, Changsheng Xu, Jian Cheng, and Hanqing Lu. 2011. TVParser: An automatic TV video parsing method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3377–3384.
[26]
Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.
[27]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.
[28]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.
[29]
Robert McKee. 2010. Story: Style, Structure, Substance, and the Principles of Screenwriting. HarperCollins e-books.
[30]
Jingjing Meng, Junsong Yuan, Jiong Yang, Gang Wang, and Yap-Peng Tan. 2015. Object instance search in videos via spatio-temporal trajectory discovery. IEEE Trans. Multim. 18, 1 (2015), 116–127.
[31]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[32]
Sosuke Mizuno and Keiji Yanai. 2020. UEC at TRECVID 2020: INS and ActEV. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/uec.pdf
[33]
Majid Mohammadi and Jafar Rezaei. 2020. Ensemble ranking: Aggregation of rankings produced by different multi-criteria decision-making methods. Omega 96 (2020), 102254.
[34]
Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multim. 13, 3 (2006), 86–91.
[35]
Yanrui Niu, Jingyao Yang, Chao Liang, Baojin Huang, and Zhongyuan Wang. 2023. A spatio-temporal identity verification method for person-action instance search in movies. In Proceedings of the 29th International Conference on MultiMedia Modeling. Springer, 82–94.
[36]
Yanrui Niu, Jingyao Yang, Ankang Lu, Baojin Huang, Yue Zhang, Ji Huang, Shishi Wen, Dongshu Xu, Chao Liang, Zhongyuan Wang, and Jun Chen. 2021. WHU-NERCMS at TRECVID2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/whu-nercms.pdf
[37]
Jianbo Ouyang, Hui Wu, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Contextual similarity aggregation with self-attention for visual re-ranking. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 3135–3148.
[38]
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the British Machine Vision Conference 2015 (BMVC 2015, Swansea, UK, September 7-10, 2015) Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, 41.1–41.12.
[39]
Yuxin Peng, Xin Huang, Jinwei Qi, Junjie Zhao, Junchao Zhang, Yunzhen Zhao, Yuxin Yuan, Xiangteng He, and Jian Zhang. 2019. PKU-ICST at TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/pku-icst.pdf
[40]
Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2020. PKU_WICT at TRECVID 2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/pku-wict.pdf
[41]
Yuxin Peng, Zhaoda Ye, Junchao Zhang, Hongbo Sun, Dejie Yang, and Zhenyu Cui. 2021. PKU_WICT at TRECVID 2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/pku_wict.pdf
[42]
Robi Polikar. 2012. Ensemble learning. In Ensemble Machine Learning. Springer, 1–34.
[43]
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. 2020. A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146–10155.
[44]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.
[45]
Prashant Giridhar Shambharkar, Umesh Kumar Nimesh, Nihal Kumar, Vj Duy Du, and M. N. Doja. 2021. Automatic face recognition and finding occurrence of actors in movies. In Inventive Communication and Computational Technologies. Springer, 115–129.
[46]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[47]
Henrique Siqueira, Sven Magg, and Stefan Wermter. 2020. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5800–5809.
[48]
Yinan Song, Wenhao Yang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2021. BUPT-MCPRL at TRECVID 2021. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/bupt-mcprl.pdf
[49]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693–5703.
[50]
Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419.
[51]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[52]
Oytun Ulutan, Swati Rallapalli, Mudhakar Srivatsa, Carlos Torres, and B. S. Manjunath. 2020. Actor conditioned attention maps for video action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 527–536.
[53]
Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581–8590.
[54]
Hung-Quoc Vo, Dung-Minh Nguyen, Tien Do, Vinh-Tiep Nguyen, Nhat-Duy Nguyen, Thanh Duc Ngo, Duy-Dinh Le, and Shin'ichi Satoh. 2020. Searching for desired person doing desired action based on visual and audio feature in large scale video database. In Proceedings of the International Conference on Multimedia Analysis and Pattern Recognition (MAPR’20). IEEE, 1–6.
[55]
Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29 (2020), 4057–4069.
[56]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
[57]
Zheng Wang, Fan Yang, and Shin’ichi Satoh. 2019. Salient time slice pruning and boosting for person-scene instance search in TV series. In Proceedings of the ACM Multimedia Asia Conference. 1–6.
[58]
Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision. Springer, 499–515.
[59]
Changsheng Xu, Jinjun Wang, Hanqing Lu, and Yifan Zhang. 2008. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multim. 10, 3 (2008), 421–436.
[60]
Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. 2007. Columbia university.s baseline detectors for 374 LSCOM semantic visual concepts. Technical Report. Columbia University. Retrieved from http://www.ee.columbia.edu/dvmm/columbia374
[61]
Jingyao Yang, Yanrui Niu Kang’an Chen, Xinyao Fan, and Chao Liang. 2020. WHU-NERCMS at TRECVID2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/whu_nercms.pdf
[62]
Wenhao Yang, Yinan Song, Zhicheng Zhao, and Fei Su. 2021. Instance search via fusing hierarchical multi-level retrieval and human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2323–2327.
[63]
En Yu, Wenhe Liu, Guoliang Kang, Xiaojun Chang, Jiande Sun, and Alexander Hauptmann. 2019. Inf@TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/inf_ins.pdf
[64]
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (Oct. 2016), 1499–1503.
[65]
Qi Zhang, Jiacheng Zhang, Zhicheng Zhao, Yanyun Zhao, and Fei Su. 2020. BUPT-MCPRL aW TRECVID 2020: INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/bupt-mcprl_ins.pdf
[66]
Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 695–712.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 42, Issue 2
March 2024
897 pages
EISSN:1558-2868
DOI:10.1145/3618075
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2023
Online AM: 29 August 2023
Accepted: 11 August 2023
Revised: 03 July 2023
Received: 26 October 2022
Published in TOIS Volume 42, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Movie video
  2. composite concepts
  3. person-action instance search

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 220
    Total Downloads
  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)7
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media