[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3343031.3350943acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Published: 15 October 2019 Publication History

Abstract

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we propose to explicitly model the object interactions in semantics and geometry based on Graph Convolutional Networks (GCNs), and fully exploit the alignment between linguistic words and visual semantic units for image captioning. Particularly, we construct a semantic graph and a geometry graph, where each node corresponds to a visual semantic unit, i.e., an object, an attribute, or a semantic (geometrical) interaction between two objects. Accordingly, the semantic (geometrical) context-aware embeddings for each unit are obtained through the corresponding GCN learning processers. At each time step, a context gated attention module takes as inputs the embeddings of the visual semantic units and hierarchically align the current word with these units by first deciding which type of visual semantic unit (object, attribute, or interaction) the current word is about, and then finding the most correlated visual semantic units under this type. Extensive experiments are conducted on the challenging MS-COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. (2016), 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998 (2017).
[3]
Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima'an. 2017. Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675 (2017).
[4]
Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239--7248.
[5]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Doll¨¢r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, and John C. Platt. 2014. From captions to visual concepts and back. (2014), 1473--1482.
[6]
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1263--1272.
[7]
Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2017. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376 (2017).
[8]
Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019. MSCap: Multi-Style Image Captioning With Unpaired Stylized Text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4204--4213.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. computer vision and pattern recognition (2016), 770--778.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[11]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668--3678.
[12]
Andrej Karpathy and Li Feifei. 2015. Deep visual-semantic alignments for generating image descriptions. computer vision and pattern recognition (2015), 3128--3137.
[13]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[14]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[15]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[16]
Tsungyi Lin, Michael Maire, Serge J Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. european conference on computer vision (2014), 740--755.
[17]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1416--1424.
[18]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 6.
[19]
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6964--6974.
[20]
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. Springer, 792--807.
[21]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[22]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical Sequence Training for Image Captioning. computer vision and pattern recognition (2017).
[23]
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1--9.
[24]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 3156--3164.
[25]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 4 (2017), 652--663.
[26]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.
[27]
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. international conference on machine learning (2015), 2048--2057.
[28]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670--685.
[29]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.
[30]
Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W Cohen. 2016. Review Networks for Caption Generation. (2016).
[31]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 684--699.
[32]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting Image Captioning with Attributes. (2016).
[33]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. (2016), 4651--4659.
[34]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

Cited By

View all
  • (2025) E d ge S can for IoT Contextual Understanding With Edge Computing and Image Captioning IEEE Internet of Things Journal10.1109/JIOT.2024.349206612:6(6519-6535)Online publication date: 15-Mar-2025
  • (2024)Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image CaptioningSensors10.3390/s2406179624:6(1796)Online publication date: 11-Mar-2024
  • (2024)Toward Attribute-Controlled Fashion Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367100020:9(1-18)Online publication date: 23-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph convolutional networks
  2. image captioning
  3. vision-language
  4. visual relationship

Qualifiers

  • Research-article

Funding Sources

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)3
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025) E d ge S can for IoT Contextual Understanding With Edge Computing and Image Captioning IEEE Internet of Things Journal10.1109/JIOT.2024.349206612:6(6519-6535)Online publication date: 15-Mar-2025
  • (2024)Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image CaptioningSensors10.3390/s2406179624:6(1796)Online publication date: 11-Mar-2024
  • (2024)Toward Attribute-Controlled Fashion Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367100020:9(1-18)Online publication date: 23-Sep-2024
  • (2024)A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multi-ModalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341745146:12(9456-9478)Online publication date: Dec-2024
  • (2024)Adaptive Semantic-Enhanced Transformer for Image CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318532035:2(1785-1796)Online publication date: Feb-2024
  • (2024)Color Enhanced Cross Correlation Net for Image Sentiment AnalysisIEEE Transactions on Multimedia10.1109/TMM.2021.311820826(4097-4109)Online publication date: 1-Jan-2024
  • (2024)A Regionally Indicated Visual Grounding Network for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349084762(1-11)Online publication date: 2024
  • (2024)Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?International Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.103939131(103939)Online publication date: Jul-2024
  • (2024)Sentinel mechanism for visual semantic graph-based image captioningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109626119(109626)Online publication date: Nov-2024
  • (2024)ClKI: closed-loop and knowledge iterative via self-distillation for image sentiment analysisInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02068-115:7(2843-2862)Online publication date: 16-Jan-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media