[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

IcoCap: Improving Video Captioning by Compounding Images

Published: 05 October 2023 Publication History

Abstract

Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density. Video data contains redundant visual content, making it difficult for captioners to generalize diverse content and avoid being misled by irrelevant elements. Moreover, redundant content is not well-trimmed to match the corresponding visual semantics in the ground truth, further increasing the difficulty of video captioning. Current research in video captioning predominantly focuses on captioner design, neglecting the impact of content density on captioner performance. Considering the differences between videos and images, there exists an another line to improve video captioning by leveraging concise and easily-learned image samples to further diversify video samples. This modification to content density compels the captioner to learn more effectively against redundancy and ambiguity. In this article, we propose a novel approach called <underline>I</underline>mage-<underline>Co</underline>mpounded learning for video <underline>Cap</underline>tioners (IcoCap) to facilitate better learning of complex video semantics. IcoCap comprises two components: the Image-Video Compounding Strategy (ICS) and Visual-Semantic Guided Captioning (VGC). ICS compounds easily-learned image semantics into video semantics, further diversifying video content and prompting the network to generalize contents in a more diverse sample. Besides, learning with the sample compounded with image contents, the captioner is compelled to better extract valuable video cues in the presence of straightforward image semantics. This helps the captioner further focus on relevant information while filtering out extraneous content. Then, VGC guides the network in flexibly learning ground truth captions based on the compounded samples, helping to mitigate the mismatch between the ground truth and ambiguous semantics in video samples. Our experimental results demonstrate the effectiveness of IcoCap in improving the learning of video captioners. Applied to the widely-used MSVD, MSR-VTT, and VATEX datasets, our approach achieves competitive or superior results compared to state-of-the-art methods, illustrating its capacity to handle the redundant and ambiguous video data

References

[1]
H. Luo et al., “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, Oct. 2022.
[2]
B. Pan et al., “Spatio-temporal graph for video captioning with knowledge distillation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10867–10876.
[3]
Z. Zhang et al., “Open-book video captioning with retrieve-copy-generate network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9832–9841.
[4]
Z. Zhang et al., “Object relational graph with teacher-recommended learning for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13278–13288.
[5]
K. Lin et al., “SwinBERT: End-to-end transformers with sparse attention for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17949–17958.
[6]
P. H. Seo, A. Nagrani, A. Arnab, and C. Schmid, “End-to-end generative pretraining for multimodal video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17959–17968.
[7]
Y. Li, Y. Pan, J. Chen, T. Yao, and T. Mei, “X-Modaler: A versatile and high-performance codebase for cross-modal analytics,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 3799–3802.
[8]
X. Wu and H. Yu, “Mars-Fl: Enabling competitors to collaborate in federated learning,” IEEE Trans. Big Data, early access, Jun. 28, 2022.
[9]
D. E. Rumelhart and D. Zipser, “Feature discovery by competitive learning,” Cogn. Sci., vol. 9, no. 1, pp. 75–112, 1985.
[10]
L. Zhu, H. Fan, Y. Luo, M. Xu, and Y. Yang, “Temporal cross-layer correlation mining for action recognition,” IEEE Trans. Multimedia, vol. 24, pp. 668–676, 2022.
[11]
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1728–1738.
[12]
X. Wang, L. Zhu, Z. Zheng, M. Xu, and Y. Yang, “Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision,” IEEE Trans. Multimedia, early acess, Sep. 05.2022.
[13]
Y. Li, R. Quan, L. Zhu, and Y. Yang, “Efficient multimodal fusion via interactive prompting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2604–2613.
[14]
Y. Han, B. Wang, R. Hong, and F. Wu, “Movie question answering via textual memory and plot graph,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 3, pp. 875–887, Mar. 2020.
[15]
Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe, “Feature selection for multimedia analysis by sharing information among multiple tasks,” IEEE Trans. Multimedia, vol. 15, no. 3, pp. 661–669, Apr. 2013.
[16]
Y. Yang et al., “Multi-feature fusion via hierarchical regression for multimedia analysis,” IEEE Trans. Multimedia, vol. 15, pp. 572–581, 2013.
[17]
A. Hanjalic and L.-Q. Xu, “Affective video content representation and modeling,” IEEE Trans. Multimedia, vol. 7, pp. 143–154, 2005.
[18]
A. Wu, Y. Han, L. Zhu, and Y. Yang, “Instance-invariant domain adaptive object detection via progressive disentanglement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 8, pp. 4178–4193, Aug. 2022.
[19]
S. Mittal et al., “A survey of accelerator architectures for 3D convolution neural networks,” J. Syst. Archit., vol. 115, 2021, Art. no.
[20]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308.
[21]
S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 305–321.
[22]
P. W. Dempsey, M. E. Allison, S. Akkaraju, C. C. Goodnow, and D. T. Fearon, “C3D of complement as a molecular adjuvant: Bridging innate and acquired immunity,” Science, vol. 271, no. 5247, pp. 348–350, 1996.
[23]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
[24]
L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8746–8755.
[25]
A. Arnab et al., “ViViT: A video vision transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 6836–6846.
[26]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. naacL-HLT, 2019, vol. 1, p. 2.
[27]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7464–7473.
[28]
J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in-1: Multi-task vision and language representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10437–10446.
[29]
H. Alwassel et al., “Self-supervised learning by cross-modal audio-video clustering,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2020, pp. 9758–9770.
[30]
X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” 2020, arXiv:2003.04297.
[31]
X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3024–3033.
[32]
M. Tang et al., “Clip4caption: Clip for video caption,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 4858–4862.
[33]
N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Video description: A survey of methods, datasets, and evaluation metrics,” ACM Comput. Surv., vol. 52, no. 6, pp. 1–37, 2019.
[34]
C. Yan et al., “Stat: Spatial-temporal attention mechanism for video captioning,” IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, Jan. 2020.
[35]
S. Chen, Q. Jin, J. Chen, and A. G. Hauptmann, “Generating video descriptions with latent topic guidance,” IEEE Trans. Multimedia, vol. 21, pp. 2407–2418, 2019.
[36]
J. Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2625–2634.
[37]
J. Chen et al., “Temporal deformable convolutional encoder-decoder networks for video captioning,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8167–8174.
[38]
S. Chen and Y.-G. Jiang, “Motion guided region message passing for video captioning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1543–1552.
[39]
X. Li et al., “MAM-RNN: Multi-level attention model based RNN for video captioning,” in Proc. Int. Joint Conf. Artif. Intell., 2017, pp. 2208–2214.
[40]
Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking informative frames for video captioning,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 358–373.
[41]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4594–4602.
[42]
P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.
[43]
S. Liu, Z. Ren, and J. Yuan, “SibNet: Sibling convolutional encoder for video captioning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 9, pp. 3259–3272, Sep. 2021.
[44]
N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian, “Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12479–12488.
[45]
L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware neural encoder for video captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1657–1666.
[46]
B. Yang, T. Zhang, and Y. Zou, “Clip meets video captioning: Concept-aware representation learning does matter,” in Proc. Chin. Conf. Pattern Recognit. Comput. Vis., 2022, pp. 368–381.
[47]
W. Zhao, X. Wu, and J. Luo, “Multi-modal dependency tree for video captioning,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2021, pp. 6634–6645.
[48]
H. Wang, Y. Xu, and Y. Han, “Spotting and aggregating salient regions for video captioning,” in Proc. 26th ACM Int. Conf. Multimedia, 2018, pp. 1519–1526.
[49]
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[50]
J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5288–5296.
[51]
D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics: Hum. Lang. Technol., 2011, pp. 190–200. [Online]. Available: https://aclanthology.org/P11-1020
[52]
X. Wang et al., “VaTex: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4580–4590.
[53]
W. Kay et al., “The kinetics human action video dataset,” 2017, arXiv:1705.06950.
[54]
S. Chen, W. Jiang, W. Liu, and Y.-G. Jiang, “Learning modality interaction for temporal sentence localization and event captioning in videos,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 333–351.
[55]
A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[56]
B. Wang et al., “Controllable video captioning with pos sequence guidance based on gated fusion network,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 2641–2650.
[57]
J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, “Joint syntax representation learning and visual cue translation for video captioning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8917–8926.
[58]
Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13093–13102.
[59]
H. Li, D. Song, L. Liao, and C. Peng, “REVNet: Bring reviewing into video captioning for a better description,” in Proc. IEEE Int. Conf. Multimedia Expo, 2019, pp. 1312–1317.
[60]
H. Ye et al., “Hierarchical modular network for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17939–17948.
[61]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[62]
M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop Stat. Mach. Transl., 2014, pp. 376–380.
[63]
C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. Conf. Text Summarization Branches Out, 2004, pp. 74–81.
[64]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.
[65]
M. Patrick et al., “Support-set bottlenecks for video-text representation learning,” in Proc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=EqoXe2zmhrh
[66]
S. Chen et al., “VALOR: Vision-audio-language omni-perception pretraining model and dataset,” 2023, arXiv:2304.08345.

Cited By

View all
  • (2024)Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable SensorsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997478:4(1-26)Online publication date: 21-Nov-2024
  • (2024)Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot LearningIEEE Transactions on Image Processing10.1109/TIP.2024.343008033(4840-4852)Online publication date: 1-Jan-2024
  • (2024)PosCap: Boosting Video Captioning with Part-of-Speech GuidancePattern Recognition and Computer Vision10.1007/978-981-97-8792-0_30(430-444)Online publication date: 18-Oct-2024

Index Terms

  1. IcoCap: Improving Video Captioning by Compounding Images
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 26, Issue
        2024
        10405 pages

        Publisher

        IEEE Press

        Publication History

        Published: 05 October 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable SensorsProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997478:4(1-26)Online publication date: 21-Nov-2024
        • (2024)Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot LearningIEEE Transactions on Image Processing10.1109/TIP.2024.343008033(4840-4852)Online publication date: 1-Jan-2024
        • (2024)PosCap: Boosting Video Captioning with Part-of-Speech GuidancePattern Recognition and Computer Vision10.1007/978-981-97-8792-0_30(430-444)Online publication date: 18-Oct-2024

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media