[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Action-aware Linguistic Skeleton Optimization Network for Non-autoregressive Video Captioning

Published: 30 October 2024 Publication History

Abstract

Non-autoregressive video captioning methods generate visual words in parallel but often overlook semantic correlations among them, especially regarding verbs, leading to lower caption quality. To address this, we integrate action information of highlighted objects to enhance semantic connections among visual words. Our proposed Action-aware Language Skeleton Optimization Network (ALSO-Net) tackles the challenge of extracting action information across frames, improving understanding of complex context-dependent video actions and reducing sentence inconsistencies. ALSO-Net incorporates a linguistic skeleton tag generator to refine semantic correlations and a video action predictor to enhance verb prediction accuracy in video captions. We also address issues of unsatisfactory caption length and quality by jointly optimizing different levels of motion prediction loss. Experimental evaluation on prominent video captioning datasets demonstrates that ALSO-Net outperforms baseline methods by a significant margin and achieves competitive performance compared to state-of-the-art autoregressive methods with smaller model complexity and faster inference time.

References

[1]
Yang Bai, Junyan Wang, Yang Long, Bingzhang Hu, Yang Song, Maurice Pagnucco, and Yu Guan. 2021. Discriminative Latent Semantic Graph for Video Captioning. In Proc. ACM Multimedia, 3556–3564.
[2]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proc. Assoc. Comput. Linguist. Workshops, 65–72.
[3]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 4724–4733.
[4]
Jingwen Chen and Hongyang Chao. 2020. VideoTRM: Pre-training for Video Captioning Challenge 2020. In Proc. ACM Multimedia, 4605–4609.
[5]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2023. Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning. ACM Trans. Multimedia Comput. Commun. Appl. 19, 1s (2023), 48:1–48:24.
[6]
Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos. In Proc. Eur. Conf. Comput. Vis, 333–351.
[7]
Shaoxiang Chen and Yu-Gang Jiang. 2021. Motion Guided Region Message Passing for Video Captioning. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 1523–1532.
[8]
Bo Dai, Sanja Fidler, and Dahua Lin. 2018. A Neural Compositional Paradigm for Image Captioning. In Adv. Neural Inf. Process. Syst, 656–666.
[9]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proc. Assoc. Comput. Linguist, 2978–2988.
[10]
Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2022. Syntax-Guided Hierarchical Attention Network for Video Captioning. IEEE Trans. Circuits Syst. Video Technol. 32, 2 (2022), 880–892.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proc. North Am. Chapter Assoc. Comput. Linguist, 4171–4186.
[12]
Shanshan Dong, Tian-Zi Niu, Xin Luo, Wu Liu, and Xinshun Xu. 2023. Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning. ACM Trans. Multimedia Comput. Commun. Appl. 19, 2 (2023), 68:1–68:18.
[13]
Zhengcong Fei. 2019. Fast Image Caption Generation with Position Alignment. arXiv:1912.06365. Retrieved from https://arxiv.org/abs/1912.06365
[14]
Zhengcong Fei. 2021. Partially Non-Autoregressive Image Captioning. In Proc. AAAI Conf. Artif. Intell, 1309–1316.
[15]
Lianli Gao, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. 2022. Hierarchical Representation Network with Auxiliary Tasks for Video Captioning and Video Question Answering. IEEE Trans. Image Process. 31 (2022), 202–215.
[16]
Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2020. Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2020), 1112–1131.
[17]
Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proc. Conf. Empirical Methods Nat. Lang. Process, 6111–6120.
[18]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 2712–2719.
[19]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. In Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 3154–3160.
[20]
Yiqing Huang, Jiansheng Chen, Wanli Ouyang, Weitao Wan, and Youze Xue. 2020. Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction. IEEE Trans. Image Process. 29 (2020), 4013–4026.
[21]
Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Zheng Wang, Xiao Wang, Junjun Jiang, and Chia-Wen Lin. 2021. Rain-Free and Residue Hand-in-Hand: A Progressive Coupled Network for Real-Time Image Deraining. IEEE Trans. Image Process. 30 (2021), 7404–7418.
[22]
Tao Jin, Siyu Huang, Ming Chen, Yingming Li, and Zhongfei Zhang. 2020. SBAT: Video Captioning with Sparse Boundary-Aware Transformer, In Proc. Int. Joint Conf. Artif. Intell, 630–636.
[23]
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv:1705.06950. Retrieved from https://arxiv.org/abs/1705.06950
[24]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. Int. Conf. Learn. Represent. 1–15.
[25]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 706–715.
[26]
Yun Lan, Ruimin Hu, Xin Xu, Dengshi Li, Chao Wang, and Xiaochen Wang. 2023. From Collective Attribute Association of Groups to Precise Attribute Association of Individuals. IEEE Trans. Multimedia 25 (2023), 1547–1554.
[27]
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proc. Assoc. Comput. Linguist, 2603–2614.
[28]
Liang Li, Xingyu Gao, Jincan Deng, Yunbin Tu, Zheng-Jun Zha, and Qingming Huang. 2022a. Long Short-Term Relation Transformer with Global Gating for Video Captioning. IEEE Trans. Image Process. 31 (2022), 2726–2738.
[29]
Linghui Li, Yongdong Zhang, Sheng Tang, Lingxi Xie, Xiaoyong Li, and Qi Tian. 2022b. Adaptive Spatial Location with Balanced Loss for Video Captioning. IEEE Trans. Circuits Syst. Video Technol. 32, 1 (2022), 17–30.
[30]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 74–81.
[31]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. Eur. Conf. Comput. Vis, 740–755.
[32]
Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, and Xu Sun. 2021. O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning, In Proc. Assoc. Comput. Linguist. Findings, 281–292.
[33]
Sheng Liu, Zhou Ren, and Junsong Yuan. 2021. SibNet: Sibling Convolutional Encoder for Video Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 43, 9 (2021), 3259–3272.
[34]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2023. Entity-Enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 3 (2023), 3003–3018.
[35]
Xiaoxiao Liu and Qingyang Xu. 2021. Adaptive Attention-Based High-Level Semantic Introduction for Image Caption. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4 (2021), 128:1–128:22.
[36]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 10867–10876.
[37]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proc. Assoc. Comput. Linguist, 311–318.
[38]
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-Attended Recurrent Network for Video Captioning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 8347–8356.
[39]
Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. 2020. Learning to Recover from Multi-Modality Errors for Non-Autoregressive Neural Machine Translation. In Proc. Assoc. Comput. Linguist, 3059–3069.
[40]
Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. 2021. Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information. In Proc. AAAI Conf. Artif. Intell, 13727–13735.
[41]
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. 2021. Semantic Grouping Network for Video Captioning. In Proc. AAAI Conf. Artif. Intell, 2514–2522.
[42]
Bin Sheng, Ping Li, Riaz Ali, and C. L. Philip Chen. 2022. Improving Video Temporal Consistency via Broad Learning System. IEEE Trans. Cybern. 52, 7 (2022), 6662–6675.
[43]
Yaya Shi, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. 2023. Learning Video-Text Aligned Representations for Video Captioning. ACM Trans. Multimedia Comput. Commun. Appl. 19, 2 (2023), 63:1–63:21.
[44]
Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2020. Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior. In Proc. AAAI Conf. Artif. Intell, 8846–8853.
[45]
Yuqing Song, Shizhe Chen, and Qin Jin. 2021. Towards Diverse Paragraph Captioning for Untrimmed Videos. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 11245–11254.
[46]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402
[47]
Zhixin Sun, Shuqin Chen, and Luo Zhong. 2022. Visual-aware Attention Dual-stream Decoder for Video Captioning. In Proc. IEEE Int. Conf. Multimedia Expo, 1–6.
[48]
Chunwei Tian, Menghua Zheng, Wangmeng Zuo, Shichao Zhang, Yanning Zhang, and Chia-Wen Lin. 2024. A Cross Transformer for Image Denoising. Inf. Fusion 102 (2024), 102043.
[49]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 4489–4497.
[50]
Yunbin Tu, Liang Li, Li Su, Junping Du, Ke Lu, and Qingming Huang. 2023. Viewpoint-Adaptive Representation Disentanglement Network for Change Captioning. IEEE Trans. Image Process. 32 (2023), 2620–2635.
[51]
Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, and Qingming Huang. 2023. Self-supervised Cross-view Representation Reconstruction for Change Captioning. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 2793–2803.
[52]
Yunbin Tu, Chang Zhou, Junjun Guo, Shengxiang Gao, and Zhengtao Yu. 2021. Enhancing the Alignment between Target Words and Corresponding Frames for Video Captioning. Pattern Recognit. 111 (2021), 107702.
[53]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 4566–4575.
[54]
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction Network for Video Captioning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 7622–7631.
[55]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019b. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In Proc. IEEE/CVF Int. Conf. Comput. Vis, 4580–4590.
[56]
Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019a. Non-Autoregressive Machine Translation with Auxiliary Regularization. In Proc. AAAI Conf. Artif. Intell, 5377–5384.
[57]
Bofeng Wu, Guocheng Niu, Jun Yu, Xinyan Xiao, Jian Zhang, and Hua Wu. 2022. Towards Knowledge-Aware Video Captioning via Transitive Visual Relationship Detection. IEEE Trans. Circuits Syst. Video Technol. 32, 10 (2022), 6753–6765.
[58]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 203–212.
[59]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 5288–5296.
[60]
Kashu Yamazaki, Khoa Vo, Quang Sang Truong, Bhiksha Raj, and Ngan Le. 2023. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. In Proc. AAAI Conf. Artif. Intell, 3081–3090.
[61]
Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2020. STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Trans. Multimedia 22, 1 (2020), 229–241.
[62]
Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang. 2021. Non-Autoregressive Coarse-to-Fine Video Captioning. In Proc. AAAI Conf. Artif. Intell, 3119–3127.
[63]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 13275–13285.
[64]
Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-Aware Action Targeting for Video Captioning. In Proc. Int. Joint Conf. Artif. Intell, 13093–13102.
[65]
Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, and Mang Ye. 2023. Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning. In Proc. AAAI Conf. Artif. Intell, 3724–3732.
[66]
Xian Zhong, Yi Zhang, Shuqin Chen, Zhixin Sun, Huantao Zheng, and Kui Jiang. 2022. Dual-scale Alignment-Based Transformer on Linguistic Skeleton Tags for Non-Autoregressive Video Captioning. In Proc. IEEE Int. Conf. Multimedia Expo.
[67]
Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020. Understanding Knowledge Distillation in Non-autoregressive Machine Translation. In Proc. Int. Conf. Learn. Represent.
[68]
Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards Automatic Learning of Procedures From Web Instructional Videos. In Proc. AAAI Conf. Artif. Intell, 7590–7598.
[69]
Yu Zhou, Zhihua Chen, Ping Li, Haitao Song, C. L. Philip Chen, and Bin Sheng. 2023. FSAD-Net: Feedback Spatial Attention Dehazing Network. IEEE Trans. Neural Networks Learn. Syst. 34, 10 (2023), 7719–7733.
[70]
Lei Zhu, Xize Wu, Jingjing Li, Zheng Zhang, Weili Guan, and Heng Tao Shen. 2023. Work Together: Correlation-Identity Reconstruction Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Knowl. Data Eng. 35, 9 (2023), 8838–8851.

Cited By

View all
  • (2024)An Empirical Study on Sentiment Intensity Analysis via Reading Comprehension ModelsProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689390(23-28)Online publication date: 28-Oct-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)GainNet: Coordinates the Odd Couple of Generative AI and 6G NetworksIEEE Network: The Magazine of Global Internetworking10.1109/MNET.2024.341867138:5(56-65)Online publication date: 24-Jun-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 10
October 2024
729 pages
EISSN:1551-6865
DOI:10.1145/3613707
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2024
Online AM: 20 July 2024
Accepted: 13 July 2024
Revised: 06 May 2024
Received: 20 August 2023
Published in TOMM Volume 20, Issue 10

Check for updates

Author Tags

  1. Video Captioning
  2. Non-Autoregressive Models
  3. Visual-Language Alignment
  4. Video Action Prediction
  5. Semantic Dependencies

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Hubei Province
  • Hubei Institute of Education Science
  • Scientific Research Foundation of Hubei University of Education for Talent Introduction
  • Fundamental Research Funds for the Central Universities
  • Hubei Provincial Collaborative Innovation Center for Basic Education Information Technology Services

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)265
  • Downloads (Last 6 weeks)62
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Empirical Study on Sentiment Intensity Analysis via Reading Comprehension ModelsProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689390(23-28)Online publication date: 28-Oct-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)GainNet: Coordinates the Odd Couple of Generative AI and 6G NetworksIEEE Network: The Magazine of Global Internetworking10.1109/MNET.2024.341867138:5(56-65)Online publication date: 24-Jun-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media