[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3240508.3240583acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Look Deeper See Richer: Depth-aware Image Paragraph Captioning

Published: 15 October 2018 Publication History

Abstract

With the widespread availability of image captioning at a sentence level, how to automatically generate image paragraphs is yet well explored. Describing an image by a full paragraph involves organising sentences orderly, coherently and diversely, inevitably leading higher complexity than by a single sentence. Existing image paragraph captioning methods give a series of sentences to represent the objects and regions of interests, where the descriptions are essentially generated by feeding the image fragments containing objects and regions into conventional image single-sentence captioning models. This strategy is difficult to generate the descriptions that guarantee the stereoscopic hierarchy and non-overlapping objects. In this paper, we propose a Depth-aware Attention Model (\textitDAM ) to generate paragraph captions for images. The depths of image areas are firstly estimated in order to discriminate objects in a range of spatial locations, which can further guide the linguistic decoder to reveal spatial relationships among objects. This model completes the paragraph in a logical and coherent manner. By incorporating the attention mechanism, the learned model swiftly shifts the sentence focus during paragraph generation, whilst avoiding verbose descriptions on a same object. Extensive quantitative experiments and the user study have been conducted on the Visual Genome dataset, which demonstrate the effectiveness and the interpretability of the proposed model.

References

[1]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan. 65--72.
[2]
Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2018. Describing Video with Attention based Bidirectional LSTM. IEEE Transactions on Cybernetics (2018).
[3]
Yi Bin, Yang Yang, Jie Zhou, Zi Huang, and Heng Tao Shen. 2017. Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning. In Proceedings of the 2017 ACM on Multimedia Conference, Mountain View, CA, USA. 1345--1353.
[4]
Yuanzhouhan Cao, Chunhua Shen, and Heng Tao Shen. 2017. Exploiting Depth From Single Monocular Images for Object Detection and Semantic Segmentation. IEEE Trans. Image Processing 26, 2 (2017), 836--846.
[5]
Hongxu Chen, Hongzhi Yin,WeiqingWang, HaoWang, Quoc Viet Hung Nguyen, and Xue Li. 2018. PME: Projected Metric Embedding on Heterogeneous Networks for Link Prediction. In KDD'18. ACM.
[6]
X. Chen, H. Fang, TY Lin, R. Vedantam, S. Gupta, P. DollÃr, and C. L. Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325 (2015).
[7]
Je# Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677--691.
[8]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada. 2366--2374.
[9]
Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. In Computer Vision - ECCV, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September, Proceedings, Part IV. 15--29.
[10]
Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video Captioning With Attention-Based LSTM and Semantic Consistency. IEEE Trans. Multimedia 19, 9 (2017), 2045--2055.
[11]
Ravi Garg, B. G. Vijay Kumar, Gustavo Carneiro, and Ian D. Reid. 2016. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part VIII. 740--756.
[12]
Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. 2017. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 6602--6611.
[13]
Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, and Heng Tao Shen. 2016. Attention-based LSTM with Semantic Consistency for Videos Captioning. In Proceedings of the 2016 ACM Conference on Multimedia Conference, Amsterdam, The Netherlands. 357--361.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 770--778.
[15]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract). In Proceedings of the Twenty-Fourth International Joint Conference on Arti!cial Intelligence, Buenos Aires, Argentina. 4188--4192.
[16]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 4565--4574.
[17]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664--676.
[18]
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada. 1889--1897.
[19]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
[20]
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hierarchical Approach for Generating Descriptive Image Paragraphs. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 3337--3345.
[21]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision.
[22]
Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P. Xing. 2017. Recurrent Topic-Transition GAN for Visual Paragraph Generation. In IEEE International Conference on Computer Vision, Venice, Italy. 3382--3391.
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[24]
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian D. Reid. 2016. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2016), 2024--2039.
[25]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved Image Captioning via Policy Gradient optimization of SPIDEr. In IEEE International Conference on Computer Vision, Venice, Italy. 873--881.
[26]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 3242--3250.
[27]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA. 311--318.
[28]
Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of Attention for Image Captioning. In IEEE International Conference on Computer Vision, Venice, Italy. 1251--1259.
[29]
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence Level Training with Recurrent Neural Networks. CoRR abs/1511.06732 (2015).
[30]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137--1149.
[31]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-Critical Sequence Training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 1179--1195.
[32]
Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In Proceedings of the Twenty-Sixth International Joint Conference on Arti!cial Intelligence, Melbourne, Australia. 2737--2743.
[33]
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.
[34]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA. 4566--4575.
[35]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA. 3156--3164.
[36]
QiWu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 203--212.
[37]
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning Multimodal Attention LSTM Networks for Video Captioning. In Proceedings of the 2017 ACM on Multimedia Conference, Mountain View, CA, USA. 537--545.
[38]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France. 2048--2057.
[39]
Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, and Heng Tao Shen. 2018. Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing (2018).
[40]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure. In 2015 IEEE International Conference on Computer Vision, Santiago, Chile. 4507--4515.
[41]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting Image Captioning with Attributes. In IEEE International Conference on Computer Vision, Venice, Italy. 4904--4912.
[42]
Hongzhi Yin, Hongxu Chen, Xiaoshuai Sun, Hao Wang, Yang Wang, and Quoc Viet Hung Nguyen. 2017. SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction. In IEEE International Conference on Data Mining, New Orleans, LA, USA,. 585--594.
[43]
Hongzhi Yin, Bin Cui, and Yuxin Huang. 2011. Finding a Wise Group of Experts in Social Networks. In Advanced Data Mining and Applications - 7th International Conference, ADMA, Beijing, China, Proceedings, Part I. 381--394.
[44]
Quanzeng You, Hailin Jin, ZhaowenWang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 4651--4659.

Cited By

View all
  • (2024)Diverse Image Captioning via Panoptic Segmentation and Sequential Conditional Variational TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369587820:12(1-17)Online publication date: 17-Sep-2024
  • (2024)Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650474(1-8)Online publication date: 30-Jun-2024
  • (2024)Image paragraph captioning with topic clustering and topic shift predictionKnowledge-Based Systems10.1016/j.knosys.2024.111401286(111401)Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention mechanism
  2. depth estimation
  3. paragraph captioning

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China (NSFC)
  • Australian Research Council (ARC)

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Diverse Image Captioning via Panoptic Segmentation and Sequential Conditional Variational TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/369587820:12(1-17)Online publication date: 17-Sep-2024
  • (2024)Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650474(1-8)Online publication date: 30-Jun-2024
  • (2024)Image paragraph captioning with topic clustering and topic shift predictionKnowledge-Based Systems10.1016/j.knosys.2024.111401286(111401)Online publication date: Feb-2024
  • (2023)Visual Paragraph Generation: Review2023 International Conference on IT Innovation and Knowledge Discovery (ITIKD)10.1109/ITIKD56332.2023.10099830(1-6)Online publication date: 8-Mar-2023
  • (2022)HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect AlignmentProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531386(380-388)Online publication date: 27-Jun-2022
  • (2022)Effective Multimodal Encoding for Image Paragraph CaptioningIEEE Transactions on Image Processing10.1109/TIP.2022.321146731(6381-6395)Online publication date: 2022
  • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 2022
  • (2022)Improving Image Paragraph Captioning with Dual Relations2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859701(1-6)Online publication date: 18-Jul-2022
  • (2022)Image Captioning State-of-the-Art: Is It Enough for the Guidance of Visually Impaired in an Environment?Advances in Computing Systems and Applications10.1007/978-3-031-12097-8_33(385-394)Online publication date: 28-Sep-2022
  • (2021)Multi-Perspective Video CaptioningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475173(5110-5118)Online publication date: 17-Oct-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media