More Web Proxy on the site http://driver.im/

research-article

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Authors:

Hanqing LuAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 765 - 773

https://doi.org/10.1145/3343031.3350943

Published: 15 October 2019 Publication History

Abstract

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processers. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. (2016), 382--398.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998 (2017).

[3]

Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima'an. 2017. Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675 (2017).

[4]

Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239--7248.

[5]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Doll¨¢r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, and John C. Platt. 2014. From captions to visual concepts and back. (2014), 1473--1482.

[6]

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1263--1272.

Digital Library

[7]

Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2017. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376 (2017).

[8]

Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019. MSCap: Multi-Style Image Captioning With Unpaired Stylized Text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4204--4213.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. computer vision and pattern recognition (2016), 770--778.

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[11]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668--3678.

[12]

Andrej Karpathy and Li Feifei. 2015. Deep visual-semantic alignments for generating image descriptions. computer vision and pattern recognition (2015), 3128--3137.

[13]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[15]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.

Digital Library

[16]

Tsungyi Lin, Michael Maire, Serge J Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. european conference on computer vision (2014), 740--755.

[17]

Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1416--1424.

[18]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 6.

[19]

Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6964--6974.

[20]

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. Springer, 792--807.

[21]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[22]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical Sequence Training for Image Captioning. computer vision and pattern recognition (2017).

[23]

Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1--9.

[24]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 3156--3164.

[25]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 4 (2017), 652--663.

Digital Library

[26]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.

[27]

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. international conference on machine learning (2015), 2048--2057.

Digital Library

[28]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670--685.

Digital Library

[29]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.

[30]

Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W Cohen. 2016. Review Networks for Caption Generation. (2016).

[31]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 684--699.

Digital Library

[32]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting Image Captioning with Attributes. (2016).

[33]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. (2016), 4651--4659.

[34]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

Cited By

Abdal Hafeth DAl-Khafajiy MKollias S(2025) E d ge S can for IoT Contextual Understanding With Edge Computing and Image Captioning IEEE Internet of Things Journal10.1109/JIOT.2024.349206612:6(6519-6535)Online publication date: 15-Mar-2025
https://doi.org/10.1109/JIOT.2024.3492066
Abdal Hafeth DKollias S(2024)Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image CaptioningSensors10.3390/s2406179624:6(1796)Online publication date: 11-Mar-2024
https://doi.org/10.3390/s24061796
Cai CYap KWang S(2024)Toward Attribute-Controlled Fashion Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367100020:9(1-18)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1145/3671000
Show More Cited By

Index Terms

Aligning Linguistic Words and Visual Semantic Units for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
    2. Natural language processing
      1. Natural language generation

Recommendations

Integrating Scene Semantic Knowledge into Image Captioning

Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus ...
Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic ...
Sentinel mechanism for visual semantic graph-based image captioning
Abstract
Image captioning aims to generate a description of a given image. However, inherent representation differences between images and sentences make it difficult to align semantic meanings for captioning. Inspired by the human cognitive processes of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Natural Science Foundation
National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
480
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)3

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abdal Hafeth DAl-Khafajiy MKollias S(2025) E d ge S can for IoT Contextual Understanding With Edge Computing and Image Captioning IEEE Internet of Things Journal10.1109/JIOT.2024.349206612:6(6519-6535)Online publication date: 15-Mar-2025
https://doi.org/10.1109/JIOT.2024.3492066
Abdal Hafeth DKollias S(2024)Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image CaptioningSensors10.3390/s2406179624:6(1796)Online publication date: 11-Mar-2024
https://doi.org/10.3390/s24061796
Cai CYap KWang S(2024)Toward Attribute-Controlled Fashion Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367100020:9(1-18)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1145/3671000
Liang KMeng LLiu MLiu YTu WWang SZhou SLiu XSun FHe K(2024)A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multi-ModalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341745146:12(9456-9478)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3417451
Zhang JFang ZSun HWang Z(2024)Adaptive Semantic-Enhanced Transformer for Image CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318532035:2(1785-1796)Online publication date: Feb-2024
https://doi.org/10.1109/TNNLS.2022.3185320
Ruan SZhang KWu LXu TLiu QChen E(2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2021.3118208
Hang RXu SLiu Q(2024)A Regionally Indicated Visual Grounding Network for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349084762(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3490847
Huang XLu KWang SLu JLi XZhang R(2024)Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?International Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.103939131(103939)Online publication date: Jul-2024
https://doi.org/10.1016/j.jag.2024.103939
Xiao FZhang NXue WGao X(2024)Sentinel mechanism for visual semantic graph-based image captioningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109626119(109626)Online publication date: Nov-2024
https://doi.org/10.1016/j.compeleceng.2024.109626
Zhang HYuan MHu LWang WLi ZYe YRen YJi D(2024)ClKI: closed-loop and knowledge iterative via self-distillation for image sentiment analysisInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02068-115:7(2843-2862)Online publication date: 16-Jan-2024
https://doi.org/10.1007/s13042-023-02068-1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten