Abstract
In image captioning, exploring the advanced semantic concepts is very important for boosting captioning performance. Although much progress has been made in this regard, most existing image captioning models usually neglect the interrelationships between objects in an image, which is a key factor of accurately understanding image content. In this paper, we propose the object-aware semantic attention object-aware semantic attention (OSA) based captioning model to address this issue. Specifically, our attention model allows the explicit associations between the objects by coupling the attention mechanism with three types of semantic concepts, i.e., the category information, relative sizes of the objects, and relative distances between objects. In reality, they are easily built up and seamlessly coupled with the well-known encoder-decoder captioning framework. In our empirical analysis, these semantic concepts favor different aspects of the image content like the number of the objects belonging to each category, the main focus of an image, and the closeness between the objects. Importantly, they are cooperated with visual features to help the attention model effectively highlight the image regions of interest for significant performance gains. By leveraging three types of semantic concepts, we derive four semantic attention models for image captioning. Extensive experiments on MSCOCO dataset show our attention models within the encoder-decoder image captioning framework perform favorably as compared to representative captioning models.
Similar content being viewed by others
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473
Banerjee S, Lavie A (2005) METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Chang X, Ma Z, Lin M, Yang Y, Hauptmann A G (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Chang X, Yu Y L, Yang Y, Xing E P (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39 (8):1617–1632
Chang Y S (2018) Fine-grained attention for image caption generation. Multimedia Tools and Applications pp 2959–2971
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306
Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Cheng Z, Ding Y, He X, Zhu L, Song X, Kankanhalli MS (2018) A3NCF: An adaptive aspect attention model for rating prediction. In: IJCAI, pp 3748–3754
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. arXiv:170503122
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38:142–158
Gkioxari G, Girshick R, Dollár P, He K (2017) Detecting and recognizing human-object interactions. arXiv:170407333
Guo Y, Cheng Z, Nie L, Wang Y, Ma J, Kankanhalli M (2019) Attentive long short-term preference modeling for personalized product search. ACM Trans Inf Syst 37(2):19
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2017) Relation networks for object detection. arXiv:171111575
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv:150904942
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li Z, Nie F, Chang X, Yang Y (2017) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110
Lin CY (2004) Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755
Liu A, Xu N, Zhang H, Nie W, Su Y, Zhang Y (2018) Multi-level policy and reward reinforcement learning for image captioning. In: IJCAI, pp 821–827
Liu A A, Xu N, Wong Y, Li J, Su Y T, Kankanhalli M (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125
Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv:170205658
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242–3250
Lu J, Yang J, Batra D, Parikh D (2018) Neural Baby Talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:14126632
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Pedersoli M, Lucas T, Schmid C, Verbeek J (2016) Areas of attention for image captioning. arXiv:161201033
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1179–1195
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5263–5271
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp 22–29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Zhou L, Xu C, Koch P, Corso J J (2017) Watch what you just said: Image captioning with text-conditional attention. In: Proceedings of the on Thematic Workshops of ACM Multimedia, vol 2017, pp 305–313
Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning. arXiv:180703514
Acknowledgements
This work was supported by the National Natural Science Foundation of China [61806213, U1435222].
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, S., Lan, L., Zhang, X. et al. Object-aware semantics of attention for image captioning. Multimed Tools Appl 79, 2013–2030 (2020). https://doi.org/10.1007/s11042-019-08209-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08209-5