[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Generating Textual Description Using Modified Beam Search

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1792))

Included in the following conference series:

  • 805 Accesses

Abstract

Generating textual descriptions of images by describing them in words is a fundamental problem that connects computer vision and natural language processing. A single image may include several entities, their orientations, appearance, and position in a scene as well as their complex spatial interactions, thus leading to a lot of possible captions for an image. Search algorithm of Beam Search has been employed for the task of sentence for the last couple of decades, although it returns around the similar captions with minor changes of wordings. We came across another search strategy, Diverse M-Best which uses M (M denotes the number of independent, diverse beam searches) beam searches from diverse starting statements and keeps the best output from each beam search, and removes the rest of (B-1) captions. This method would mostly lead us to many possible diverse generated sequences, but running Beam Search M several times would be computationally expensive. With the above stated works in vision, we have devised and implemented a novel algorithm, Modified Beam Search (MBS), for generation of Diverse and better captions, with an increase in the computational complexity as compared to the Beam Search. We obtained improvements on BLEU-3 and BLEU-4 scores by 1–3% over the top-2 predicted captions from the original beam search captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)

  2. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation cite arxiv:1406.1078 (2014). Comment: EMNLP 2014

  3. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Over feat: integrated recognition, localization, and detection using convolutional networks (2013)

    Google Scholar 

  4. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation (2002). https://doi.org/10.3115/1073083.1073135

  5. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator, pp. 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 (2015)

  6. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015)

    Google Scholar 

  7. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: 2013 Proceedings of NAACL-HLT, pp. 380–390 (2013)

    Google Scholar 

  8. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation, pp. 56–60 (2017). https://doi.org/10.18653/v1/W17-3207

  9. Ozkan, E., Roig, G., Goksel, O., Boix, X.: Herding generalizes diverse M-best solutions (2016)

    Google Scholar 

  10. Szegedy, C., et al.: Going deeper with convolutions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

  11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2015)

    Google Scholar 

  12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  13. Tanti, M., Gatt, A., Camilleri, K.: Where to put the image in an image caption generator. Nat. Lang. Eng. 24, 467–489 (2018)

    Article  Google Scholar 

  14. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description, pp. 2625–2634 (2015). https://doi.org/10.1109/CVPR.2015.7298878

  15. Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal Neural Language Models. In: Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 595–603 (2014). https://proceedings.mlr.press/v32/kiros14.html

  16. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)

    Google Scholar 

  17. Chen, X., Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation, pp. 2422–2431 (2015). https://doi.org/10.1109/CVPR.2015.7298856

  18. Wiseman, S., Rush, A.: Sequence-to-sequence learning as beam-search optimization, pp. 1296–1306 (2016). https://doi.org/10.18653/v1/D16-1137

  19. Cohen, E., Beck, C.: Empirical analysis of beam search performance degradation in neural sequence models. In: Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research, vol. 97, pp. 1290–1299 (2019). https://proceedings.mlr.press/v97/cohen19a.html

  20. Xie, H., Sherborne, T., Kuhnle, A., Copestake, A.: Going beneath the surface: evaluating image captioning for grammaticality, truthfulness and diversity (2019)

    Google Scholar 

  21. Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12632–12641 (2021). https://doi.org/10.1109/CVPR46437.2021.01245

  22. Sundaramoorthy, C., Kelvin, L., Sarin, M., Gupta, S.: End-to-end attention-based image captioning (2021)

    Google Scholar 

  23. Sadler, P., Scheffler, T., Schlangen, D.: Can neural image captioning be controlled via forced attention?, pp. 427–431 (2019). https://doi.org/10.18653/v1/W19-8653

  24. Sun, Q., Lee, S., Batra, D.: Bidirectional beam search: forward-backward inference in neural sequence models for fill-in-the-blank image captioning (2017)

    Google Scholar 

  25. Agnihotri, S.: Hyperparameter optimization on neural machine translation. Creat. Components 124 (2019). https://lib.dr.iastate.edu/creativecomponents/124

  26. Elliott, D., Keller, F.: Comparing automatic evaluation measures for image description. In: ACL, pp. 452–457 (2014)

    Google Scholar 

  27. You, Q., et al.: Image Captioning with semantic attention. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)

    Google Scholar 

  28. Hsu, T.-Y., Giles, C., Huang, T-H.: SCICAP: generating captions for scientific figures (2021)

    Google Scholar 

  29. Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D.: Fusion models for improved image captioning. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12666, pp. 381–395. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68780-9_32

    Chapter  Google Scholar 

  30. Fisch, A., Lee, K., Chang, M.-W., Clark, J., Barzilay, R.: CapWAP: image captioning with a purpose, pp. 8755–8768 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.705

  31. Liu, M., Hu, H., Li, L., Yu, Y., Guan, W.: Chinese image caption generation via visual attention and topic modeling. IEEE Trans. Cybern. 52, 1247–1257 (2020)

    Article  Google Scholar 

  32. Laskar, S.R., Singh, R.P., Pakray, P., Bandyopadhyay, S.: English to Hindi multi-modal neural machine translation and Hindi image captioning. In: Proceedings of the 6th Workshop on Asian Translation. Association for Computational Linguistics, Hong Kong (2019)

    Google Scholar 

  33. Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning (2022)

    Google Scholar 

  34. Cornia, M., Baraldi, L., Cucchiara, R.: Explaining transformer-based image captioning models: an empirical analysis, pp. 111–129 (2022). https://doi.org/10.3233/AIC-210172

  35. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2020)

    Google Scholar 

  36. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows (2021)

    Google Scholar 

  37. Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words (2019)

    Google Scholar 

  38. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 10575–10584 (2020). https://doi.org/10.1109/CVPR42600.2020.01059

  39. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Divyansh Rai or Arpit Agarwal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rai, D., Agarwal, A., Kumar, B., Vyas, O.P., Khan, S., Shourya, S. (2023). Generating Textual Description Using Modified Beam Search. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1792. Springer, Singapore. https://doi.org/10.1007/978-981-99-1642-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1642-9_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1641-2

  • Online ISBN: 978-981-99-1642-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics