Generating Textual Description Using Modified Beam Search

Divyansh Rai¹⁰,
Arpit Agarwal¹⁰,
Bagesh Kumar¹⁰,
O. P. Vyas¹⁰,
Suhaib Khan¹⁰ &
…
S. Shourya¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1792))

Included in the following conference series:

International Conference on Neural Information Processing

862 Accesses

Abstract

Generating textual descriptions of images by describing them in words is a fundamental problem that connects computer vision and natural language processing. A single image may include several entities, their orientations, appearance, and position in a scene as well as their complex spatial interactions, thus leading to a lot of possible captions for an image. Search algorithm of Beam Search has been employed for the task of sentence for the last couple of decades, although it returns around the similar captions with minor changes of wordings. We came across another search strategy, Diverse M-Best which uses M (M denotes the number of independent, diverse beam searches) beam searches from diverse starting statements and keeps the best output from each beam search, and removes the rest of (B-1) captions. This method would mostly lead us to many possible diverse generated sequences, but running Beam Search M several times would be computationally expensive. With the above stated works in vision, we have devised and implemented a novel algorithm, Modified Beam Search (MBS), for generation of Diverse and better captions, with an increase in the computational complexity as compared to the Beam Search. We obtained improvements on BLEU-3 and BLEU-4 scores by 1–3% over the top-2 predicted captions from the original beam search captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Natural Language Processing of Text-Based Metrics for Image Captioning

Image Caption Generator with a Combination Between Convolutional Neural Network and Long Short-Term Memory

HiFi-Score: Fine-Grained Image Description Evaluation with Hierarchical Parsing Graphs

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation cite arxiv:1406.1078 (2014). Comment: EMNLP 2014
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Over feat: integrated recognition, localization, and detection using convolutional networks (2013)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation (2002). https://doi.org/10.3115/1073083.1073135
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015)
Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: 2013 Proceedings of NAACL-HLT, pp. 380–390 (2013)
Google Scholar
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation, pp. 56–60 (2017). https://doi.org/10.18653/v1/W17-3207
Ozkan, E., Roig, G., Goksel, O., Boix, X.: Herding generalizes diverse M-best solutions (2016)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Tanti, M., Gatt, A., Camilleri, K.: Where to put the image in an image caption generator. Nat. Lang. Eng. 24, 467–489 (2018)
Article Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description, pp. 2625–2634 (2015). https://doi.org/10.1109/CVPR.2015.7298878
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal Neural Language Models. In: Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 595–603 (2014). https://proceedings.mlr.press/v32/kiros14.html
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Google Scholar
Chen, X., Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation, pp. 2422–2431 (2015). https://doi.org/10.1109/CVPR.2015.7298856
Wiseman, S., Rush, A.: Sequence-to-sequence learning as beam-search optimization, pp. 1296–1306 (2016). https://doi.org/10.18653/v1/D16-1137
Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 97, pp. 1290–1299 (2019). https://proceedings.mlr.press/v97/cohen19a.html
Xie, H., Sherborne, T., Kuhnle, A., Copestake, A.: Going beneath the surface: evaluating image captioning for grammaticality, truthfulness and diversity (2019)
Google Scholar
Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12632–12641 (2021). https://doi.org/10.1109/CVPR46437.2021.01245
Sundaramoorthy, C., Kelvin, L., Sarin, M., Gupta, S.: End-to-end attention-based image captioning (2021)
Google Scholar
Sadler, P., Scheffler, T., Schlangen, D.: Can neural image captioning be controlled via forced attention?, pp. 427–431 (2019). https://doi.org/10.18653/v1/W19-8653
Sun, Q., Lee, S., Batra, D.: Bidirectional beam search: forward-backward inference in neural sequence models for fill-in-the-blank image captioning (2017)
Google Scholar
Agnihotri, S.: Hyperparameter optimization on neural machine translation. Creat. Components 124 (2019). https://lib.dr.iastate.edu/creativecomponents/124
Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: ACL, pp. 452–457 (2014)
Google Scholar
You, Q., et al.: Image Captioning with semantic attention. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Google Scholar
Hsu, T.-Y., Giles, C., Huang, T-H.: SCICAP: generating captions for scientific figures (2021)
Google Scholar
Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D.: Fusion models for improved image captioning. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12666, pp. 381–395. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68780-9_32
Chapter Google Scholar
Fisch, A., Lee, K., Chang, M.-W., Clark, J., Barzilay, R.: CapWAP: image captioning with a purpose, pp. 8755–8768 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.705
Liu, M., Hu, H., Li, L., Yu, Y., Guan, W.: Chinese image caption generation via visual attention and topic modeling. IEEE Trans. Cybern. 52, 1247–1257 (2020)
Article Google Scholar
Laskar, S.R., Singh, R.P., Pakray, P., Bandyopadhyay, S.: English to Hindi multi-modal neural machine translation and Hindi image captioning. In: Proceedings of the 6th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong (2019)
Google Scholar
Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning (2022)
Google Scholar
Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis, pp. 111–129 (2022). https://doi.org/10.3233/AIC-210172
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words (2019)
Google Scholar
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology, Allahabad, Prayagraj, India
Divyansh Rai, Arpit Agarwal, Bagesh Kumar, O. P. Vyas, Suhaib Khan & S. Shourya

Authors

Divyansh Rai
View author publications
You can also search for this author in PubMed Google Scholar
Arpit Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Bagesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
O. P. Vyas
View author publications
You can also search for this author in PubMed Google Scholar
Suhaib Khan
View author publications
You can also search for this author in PubMed Google Scholar
S. Shourya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Divyansh Rai or Arpit Agarwal .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rai, D., Agarwal, A., Kumar, B., Vyas, O.P., Khan, S., Shourya, S. (2023). Generating Textual Description Using Modified Beam Search. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1792. Springer, Singapore. https://doi.org/10.1007/978-981-99-1642-9_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-1642-9_12
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1641-2
Online ISBN: 978-981-99-1642-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics