More Web Proxy on the site http://driver.im/

research-article

Free access

Just Accepted

Boosting Semi-Supervised Video Captioning via Learning Candidates Adjusters

Authors:

Zhenjiang Miao,

Qiang JiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications

Accepted on 25 February 2024

https://doi.org/10.1145/3652838

Online AM: 11 July 2024 Publication History

Abstract

Video captioning is a multimodal task on both CV and NLP, whose goal is to automatically obtain the description of video content with natural language statements. Although there are amounts of video data, their annotations with description sentences are very limited. In this paper, we define the semi-supervised video captioning (SSVC) problem in order to improve performance with limited annotations by leveraging the semantic knowledge from both well-annotated samples and no-annotated samples. To address the problem, we introduce a LCA-boosted model (LCABM) for boosting SSVC, where it is first to explore a learnable candidates adjuster to adjust the caption candidates and then treat these adjusted captions as pesudo labels to train the SSVC model with no-annotated samples in reverse. In particular, the model learning is considered as a bi-level optimization problem and solved by an EM-like multi-stage training algorithm. The experiments show the effectiveness of our proposed LCABM, whose performance is comparable and even better than those state-of-the-art fully-supervised methods even with less annotations.

References

[1]

Yang Bai, Junyan Wang, Yang Long, Bingzhang Hu, Yang Song, Maurice Pagnucco, and Yu Guan. 2021. Discriminative Latent Semantic Graph for Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia.

Digital Library

[2]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.

[3]

David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 190–200.

Digital Library

[4]

Haoran Chen, Ke Lin, Alexander Maye, Jianmin Li, and Xiaolin Hu. 2020. A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling. Frontiers in Robotics and AI 7 (2020).

[5]

Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning. arXiv preprint arXiv:1905.01077(2019).

[6]

Ming Chen, Yingming Li, Zhongfei Zhang, and Siyu Huang. 2018. Tvt: Two-view transformer network for video captioning. In Asian Conference on Machine Learning. PMLR, 847–862.

[7]

Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, and Min Sun. 2017. Show, adapt and tell: Adversarial training of cross-domain image captioner. In Proceedings of the IEEE international conference on computer vision. 521–530.

[8]

Wenhu Chen, Aurélien Lucchi, and Thomas Hofmann. 2016. A Semi-supervised Framework for Image Captioning. arXiv: Computer Vision and Pattern Recognition (2016).

[9]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358–373.

Digital Library

[10]

Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2021. Syntax-guided Hierarchical Attention Network for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021), 1–1.

[11]

Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast parameter adaptation for few-shot image captioning and visual question answering. In Proceedings of the 26th ACM international conference on Multimedia. 54–62.

Digital Library

[12]

Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang, Kangnian Weng, Tao Zhang, and Weiguo Fan. 2019. Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8271–8278.

Digital Library

[13]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised Image Captioning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 4120–4129.

[14]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5630–5639.

[15]

Lianli Gao, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. 2022. Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering. IEEE Transactions on Image Processing(2022).

[16]

Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE transactions on pattern analysis and machine intelligence (2019).

[17]

Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[18]

Dan Guo, Yang Wang, Peipei Song, and Meng Wang. 2020. Recurrent Relational Memory Network for Unsupervised Image Captioning. In IJCAI.

[19]

Wangli Hao, Zhaoxiang Zhang, He Guan, and He Guan. 2018. Integrating both visual and audio cues for enhanced video caption. In Thirty-Second AAAI Conference on Artificial Intelligence.

[20]

Tao Jin, Siyu Huang, Ming Chen, Yingming Li, and Zhongfei Zhang. 2020. SBAT: Video Captioning with Sparse Boundary-Aware Transformer. In IJCAI.

[21]

Tao Jin, Zhou Zhao, Peng Wang, Jun Yu, and Fei Wu. 2022. Interaction augmented transformer with decoupled decoding for video captioning. Neurocomputing 492(2022), 496–507.

Digital Library

[22]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.

[23]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[24]

Iro Laina, C. Rupprecht, and Nassir Navab. 2019. Towards Unsupervised Image Captioning With Shared Multimodal Embeddings. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 7413–7423.

[25]

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023. VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355(2023).

[26]

Lijun Li and Boqing Gong. 2019. End-to-end video captioning with multitask reinforcement learning. In 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, 339–348.

[27]

Wei Li, Dashan Guo, and Xiangzhong Fang. 2018. Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recognition Letters 105 (2018), 23–29.

Digital Library

[28]

Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng, Cheng Niu, and Jie Zhou. 2021. Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog. IEEE Transactions on Audio, Speech, and Language Processing (2021), 1–1.

[29]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out(2004).

[30]

Ke Lin, Zhuoxin Gan, and Liwei Wang. 2021. Augmented Partial Mutual Learning with Frame Masking for Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2047–2055.

[31]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.

[32]

Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, and Lorenzo Torresani. 2021. Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7005–7015.

[33]

Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, and Xu Sun. 2021. O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning. In ACL/IJCNLP (Findings).

[34]

Sheng Liu, Zhou Ren, and Junsong Yuan. 2021. SibNet: Sibling Convolutional Encoder for Video Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 9(2021), 3259–3272.

[35]

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353(2020).

[36]

Salman Khan Muhammad Maaz, Hanoona Rasheedand Fahad Khan. 2023. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. ArXiv 2306.05424 (2023).

[37]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879.

[38]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.

[39]

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.

Digital Library

[40]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[41]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.

[42]

Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D Yoo. 2021. Semantic Grouping Network for Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2514–2522.

[43]

Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2018. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE transactions on neural networks and learning systems 30, 10(2018), 3047–3058.

[44]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.

Digital Library

[45]

Yunbin Tu, Chang Zhou, Junjun Guo, Shengxiang Gao, and Zhengtao Yu. 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Pattern Recognition 111(2021), 107702.

[46]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.

[47]

Duc Minh Vo, Hong Chen, Akihiro Sugimoto, and Hideki Nakayama. 2022. NOC-REK: Novel Object Captioning With Retrieved Vocabulary From External Knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18008.

[48]

Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.

[49]

Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, and Haifeng Hu. 2021. Event-Centric Hierarchical Representation for Dense Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology 31 (2021), 1890–1900.

Digital Library

[50]

Xin Eric Wang, Jiawei Wu, Da Zhang, Yu Su, and William Yang Wang. 2019. Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning. In AAAI.

[51]

Aming Wu, Yahong Han, Yi Yang, Qinghua Hu, and Fei Wu. 2020. Convolutional Reconstruction-to-Sequence for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology 30 (2020), 4299–4308.

[52]

Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable video captioning via trajectory structured localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6829–6837.

[53]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500.

[54]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.

[55]

Ning Xu, Anan Liu, Yongkang Wong, Yongdong Zhang, Weizhi Nie, Yuting Su, and M. Kankanhalli. 2019. Dual-Stream Recurrent Neural Network for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology 29 (2019), 2482–2493.

[56]

Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang. 2021. Non-Autoregressive Coarse-to-Fine Video Captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3119–3127.

[57]

Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2022. Hierarchical Modular Network for Video Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17939–17948.

[58]

Jing Zhang, Zhongjun Fang, Han Sun, and Zhe Wang. 2022. Adaptive Semantic-Enhanced Transformer for Image Captioning. IEEE Transactions on Neural Networks and Learning Systems (2022).

[59]

Junchao Zhang and Yuxin Peng. 2020. Video Captioning With Object-Aware Spatio-Temporal Correlation and Aggregation. IEEE Transactions on Image Processing 29 (2020), 6209–6222.

[60]

Wei Zhang, Bairui Wang, Lin Ma, and Wei Liu. 2019. Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning. arXiv preprint arXiv:1906.01452(2019).

[61]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278–13288.

[62]

Zhiwang Zhang, Dong Xu, Wanli Ouyang, and Chuanqi Tan. 2020. Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization. IEEE Transactions on Circuits and Systems for Video Technology 30 (2020), 3130–3139.

Digital Library

[63]

Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13096–13105.

[64]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV). 695–712.

Digital Library

Index Terms

Boosting Semi-Supervised Video Captioning via Learning Candidates Adjusters
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Natural language processing

Recommendations

Semi-supervised partial label learning algorithm via reliable label propagation
Abstract
Partial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...
Hybrid supervised instance segmentation by learning label noise suppression
Abstract
To reach top accuracy, current fully supervised instance segmentation methods severely rely on large-scale pixel-wise labeled datasets. They are usually expensive and time-consuming to obtain. Though weakly or semi-supervised methods ...
Segmentation of Left Atrial MR Images via Self-supervised Semi-supervised Meta-learning
Medical Image Computing and Computer Assisted Intervention – MICCAI 2021
Abstract
Deep learning algorithms for cardiac MRI segmentation depend heavily upon abundant, labelled data located at a single medical centre. Clinical settings, however, contain abundant, unlabelled and scarce, labelled data located across distinct ... $^{}$ $^{}$

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted

EISSN:1551-6865

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 11 July 2024

Revised: 10 July 2024

Accepted: 25 February 2024

Received: 05 August 2023

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
196
Total Downloads

Downloads (Last 12 months)196
Downloads (Last 6 weeks)37

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables