[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Textual Description Generation for Visual Content Using Neural Networks

  • Conference paper
  • First Online:
Intelligent Human Computer Interaction (IHCI 2021)

Abstract

Various methods in machine learning have noticeable use in generating descriptive text for images and video frames and processing them. This area has attracted the immense interest of researchers in past years. For text generation, various models contain CNN and RNN combined approaches. RNN works well in language modeling; it lacks in maintaining information for a long time. An LSTM language model can overcome this drawback because of its long-term dependency handling. Here, the proposed methodology is an Encoder-Decoder approach where VGG19 Convolution Neural Network is working as Encoder; LSTM language model is working as Decoder to generate the sentence. The model is trained and tested on the Flickr8K dataset and can generate textual descriptions on a larger dataset Flickr30K with the slightest modifications. The results are generated using BLEU scores (Bilingual Evaluation Understudy Score). A GUI tool is developed to help in the field of child education. This tool generates audio for the generated textual description for images and helps to search for similar content on the internet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 79.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 99.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chu, Y., Yue, X., Yu, L., Sergei, M., Wang, Z.: Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel. Commun. Mob. Comput. 2020, 1–7 (2020)

    Google Scholar 

  2. Xu, N., et al.: Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimed. 22(5), 1372–1383 (2020)

    Article  Google Scholar 

  3. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR (2015)

    Google Scholar 

  4. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)

    Google Scholar 

  5. Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM International Conference on Multimedia (MM 2016), pp. 988–997. Association for Computing Machinery, New York (2016)

    Google Scholar 

  6. Aung, S., Pa, W., Nwe, T.: Automatic Myanmar image captioning using CNN and LSTM-based language model. In: 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Marseille, France (2020)

    Google Scholar 

  7. Hewage, R.: Extract Features, Visualize Filters and Feature Maps in VGG16 and VGG19 CNN Models (2020). https://towardsdatascience.com/extract-features-visualize-filters-and-feature-maps-in-vgg16-and-vgg19-cnn-models-d2da6333edd0?. Accessed 29 June 2021

  8. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. Stanford University (2017)

    Google Scholar 

  9. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)

    Google Scholar 

  10. Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN RNN: a unified frame-work for multi-label image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294 (2016)

    Google Scholar 

  11. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, 4651–4659 (2016)

    Google Scholar 

  12. Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR, 1601–1608 (2011)

    Google Scholar 

  13. Alzubi, J.A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., Gupta, P.: Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. J. Intell. Fuzzy Syst. 40(4), 5761–5769 (2021). https://doi.org/10.3233/jifs-189415

    Article  Google Scholar 

  14. Hossain, M.D.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51(6), 1–36 (2018)

    Article  Google Scholar 

  15. Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical Turk. In: NAACL-HLT Workshop 2010, pp. 139–147 (2010)

    Google Scholar 

  16. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In: ICLR 2015. arXiv:1412.6632

  17. Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5804–5812 (2018)

    Google Scholar 

  18. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4894–4902 (2017)

    Google Scholar 

  19. Aneja, J., Deshpande, A., Schwing, A.G.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5561–5570 (2018)

    Google Scholar 

  20. Feng, Y., Ma, L., Liu, W., Luo, J.: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4125–4134 (2019)

    Google Scholar 

  21. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)

    Google Scholar 

  22. Zhou, Y., Sun, Y., Honavar, V.: Improving image captioning by leveraging knowledge graphs. In: IEEE Winter Conference on Applications of Computer Vision (WACV) 2019, pp. 283–293 (2019). https://doi.org/10.1109/WACV.2019.00036

  23. Tran, K., et al.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 49–56 (2016)

    Google Scholar 

  24. Sun, B., et al.: Supercaptioning: image captioning using two-dimensional word embedding. arXiv preprint arXiv:1905.10515 (2019)

  25. Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8, 218386–218400 (2020). https://doi.org/10.1109/ACCESS.2020.3042484

    Article  Google Scholar 

  26. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, July 2018

    Google Scholar 

  27. Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Komal Garg .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Garg, K., Singh, V., Tiwary, U.S. (2022). Textual Description Generation for Visual Content Using Neural Networks. In: Kim, JH., Singh, M., Khan, J., Tiwary, U.S., Sur, M., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2021. Lecture Notes in Computer Science, vol 13184. Springer, Cham. https://doi.org/10.1007/978-3-030-98404-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98404-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98403-8

  • Online ISBN: 978-3-030-98404-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics