Abstract
The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.
D. Yan and W. Yu—These authors have contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv: 1707.07998 (2017)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Computer Science, pp. 4566–4575 (2015)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Liu, D., Zha, Z.-J., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for sequence-level image captioning. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1416–1424. ACM (2018)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Guo, L., Liu, J., Tang, J., Li, J., Luo, W., Lu, H.: Aligning linguistic words and visual semantic units for image captioning. In: ACM MM (2019)
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv: 1906.05963 (2019)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., Lu, H.: Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of SPIDEr. In: Proceedings of the International Conference on Computer Vision (2017)
Ranzato, M.A., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: Proceedings of the International Conference on Learning Representations (2015)
Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 510–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_31
Chen, F., Ji, R., Sun, X., Wu, Y., Su, J.: GroupCap: group-based image captioning with structured relevance and diversity constraints. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Huang, L., Wang, W., Chen, J., Wei, X.-Y.: Attention on attention for image captioning. In: Proceedings of the International Conference on Computer Vision (2019)
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv:1906.05963 (2019)
Acknowledgment
This research is supported by Sichuan Science and Technology Program (No. 2020YFS0307, No. 2020YFG0430, No. 2019YFS0146), Mianyang Science and Technology Program (2020YFZJ016).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, D., Yu, W., Zhang, Z., Gong, J. (2021). Transformer with Prior Language Knowledge for Image Captioning. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-92270-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92269-6
Online ISBN: 978-3-030-92270-2
eBook Packages: Computer ScienceComputer Science (R0)