Abstract
Generating textual descriptions of images by describing them in words is a fundamental problem that connects computer vision and natural language processing. A single image may include several entities, their orientations, appearance, and position in a scene as well as their complex spatial interactions, thus leading to a lot of possible captions for an image. Search algorithm of Beam Search has been employed for the task of sentence for the last couple of decades, although it returns around the similar captions with minor changes of wordings. We came across another search strategy, Diverse M-Best which uses M (M denotes the number of independent, diverse beam searches) beam searches from diverse starting statements and keeps the best output from each beam search, and removes the rest of (B-1) captions. This method would mostly lead us to many possible diverse generated sequences, but running Beam Search M several times would be computationally expensive. With the above stated works in vision, we have devised and implemented a novel algorithm, Modified Beam Search (MBS), for generation of Diverse and better captions, with an increase in the computational complexity as compared to the Beam Search. We obtained improvements on BLEU-3 and BLEU-4 scores by 1–3% over the top-2 predicted captions from the original beam search captions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation cite arxiv:1406.1078 (2014). Comment: EMNLP 2014
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Over feat: integrated recognition, localization, and detection using convolutional networks (2013)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation (2002). https://doi.org/10.3115/1073083.1073135
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015)
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: 2013 Proceedings of NAACL-HLT, pp. 380–390 (2013)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation, pp. 56–60 (2017). https://doi.org/10.18653/v1/W17-3207
Ozkan, E., Roig, G., Goksel, O., Boix, X.: Herding generalizes diverse M-best solutions (2016)
Szegedy, C., et al.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Tanti, M., Gatt, A., Camilleri, K.: Where to put the image in an image caption generator. Nat. Lang. Eng. 24, 467–489 (2018)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description, pp. 2625–2634 (2015). https://doi.org/10.1109/CVPR.2015.7298878
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal Neural Language Models. In: Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 595–603 (2014). https://proceedings.mlr.press/v32/kiros14.html
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Chen, X., Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation, pp. 2422–2431 (2015). https://doi.org/10.1109/CVPR.2015.7298856
Wiseman, S., Rush, A.: Sequence-to-sequence learning as beam-search optimization, pp. 1296–1306 (2016). https://doi.org/10.18653/v1/D16-1137
Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 97, pp. 1290–1299 (2019). https://proceedings.mlr.press/v97/cohen19a.html
Xie, H., Sherborne, T., Kuhnle, A., Copestake, A.: Going beneath the surface: evaluating image captioning for grammaticality, truthfulness and diversity (2019)
Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12632–12641 (2021). https://doi.org/10.1109/CVPR46437.2021.01245
Sundaramoorthy, C., Kelvin, L., Sarin, M., Gupta, S.: End-to-end attention-based image captioning (2021)
Sadler, P., Scheffler, T., Schlangen, D.: Can neural image captioning be controlled via forced attention?, pp. 427–431 (2019). https://doi.org/10.18653/v1/W19-8653
Sun, Q., Lee, S., Batra, D.: Bidirectional beam search: forward-backward inference in neural sequence models for fill-in-the-blank image captioning (2017)
Agnihotri, S.: Hyperparameter optimization on neural machine translation. Creat. Components 124 (2019). https://lib.dr.iastate.edu/creativecomponents/124
Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: ACL, pp. 452–457 (2014)
You, Q., et al.: Image Captioning with semantic attention. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Hsu, T.-Y., Giles, C., Huang, T-H.: SCICAP: generating captions for scientific figures (2021)
Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D.: Fusion models for improved image captioning. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12666, pp. 381–395. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68780-9_32
Fisch, A., Lee, K., Chang, M.-W., Clark, J., Barzilay, R.: CapWAP: image captioning with a purpose, pp. 8755–8768 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.705
Liu, M., Hu, H., Li, L., Yu, Y., Guan, W.: Chinese image caption generation via visual attention and topic modeling. IEEE Trans. Cybern. 52, 1247–1257 (2020)
Laskar, S.R., Singh, R.P., Pakray, P., Bandyopadhyay, S.: English to Hindi multi-modal neural machine translation and Hindi image captioning. In: Proceedings of the 6th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong (2019)
Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning (2022)
Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis, pp. 111–129 (2022). https://doi.org/10.3233/AIC-210172
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words (2019)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA (2019)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rai, D., Agarwal, A., Kumar, B., Vyas, O.P., Khan, S., Shourya, S. (2023). Generating Textual Description Using Modified Beam Search. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1792. Springer, Singapore. https://doi.org/10.1007/978-981-99-1642-9_12
Download citation
DOI: https://doi.org/10.1007/978-981-99-1642-9_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1641-2
Online ISBN: 978-981-99-1642-9
eBook Packages: Computer ScienceComputer Science (R0)