Textual Description Generation for Visual Content Using Neural Networks

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13184))

Included in the following conference series:

International Conference on Intelligent Human Computer Interaction

1837 Accesses
3 Citations

Abstract

Various methods in machine learning have noticeable use in generating descriptive text for images and video frames and processing them. This area has attracted the immense interest of researchers in past years. For text generation, various models contain CNN and RNN combined approaches. RNN works well in language modeling; it lacks in maintaining information for a long time. An LSTM language model can overcome this drawback because of its long-term dependency handling. Here, the proposed methodology is an Encoder-Decoder approach where VGG19 Convolution Neural Network is working as Encoder; LSTM language model is working as Decoder to generate the sentence. The model is trained and tested on the Flickr8K dataset and can generate textual descriptions on a larger dataset Flickr30K with the slightest modifications. The results are generated using BLEU scores (Bilingual Evaluation Understudy Score). A GUI tool is developed to help in the field of child education. This tool generates audio for the generated textual description for images and helps to search for similar content on the internet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 79.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 99.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Story Telling: Learning to Visualize Sentences Through Generated Scenes

Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Image Description Generation Using Deep Learning

References

Chu, Y., Yue, X., Yu, L., Sergei, M., Wang, Z.: Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel. Commun. Mob. Comput. 2020, 1–7 (2020)
Google Scholar
Xu, N., et al.: Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimed. 22(5), 1372–1383 (2020)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Google Scholar
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM International Conference on Multimedia (MM 2016), pp. 988–997. Association for Computing Machinery, New York (2016)
Google Scholar
Aung, S., Pa, W., Nwe, T.: Automatic Myanmar image captioning using CNN and LSTM-based language model. In: 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Marseille, France (2020)
Google Scholar
Hewage, R.: Extract Features, Visualize Filters and Feature Maps in VGG16 and VGG19 CNN Models (2020). https://towardsdatascience.com/extract-features-visualize-filters-and-feature-maps-in-vgg16-and-vgg19-cnn-models-d2da6333edd0?. Accessed 29 June 2021
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. Stanford University (2017)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)
Google Scholar
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN RNN: a unified frame-work for multi-label image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294 (2016)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, 4651–4659 (2016)
Google Scholar
Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR, 1601–1608 (2011)
Google Scholar
Alzubi, J.A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., Gupta, P.: Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. J. Intell. Fuzzy Syst. 40(4), 5761–5769 (2021). https://doi.org/10.3233/jifs-189415
Article Google Scholar
Hossain, M.D.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51(6), 1–36 (2018)
Article Google Scholar
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical Turk. In: NAACL-HLT Workshop 2010, pp. 139–147 (2010)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In: ICLR 2015. arXiv:1412.6632
Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5804–5812 (2018)
Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4894–4902 (2017)
Google Scholar
Aneja, J., Deshpande, A., Schwing, A.G.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5561–5570 (2018)
Google Scholar
Feng, Y., Ma, L., Liu, W., Luo, J.: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4125–4134 (2019)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)
Google Scholar
Zhou, Y., Sun, Y., Honavar, V.: Improving image captioning by leveraging knowledge graphs. In: IEEE Winter Conference on Applications of Computer Vision (WACV) 2019, pp. 283–293 (2019). https://doi.org/10.1109/WACV.2019.00036
Tran, K., et al.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 49–56 (2016)
Google Scholar
Sun, B., et al.: Supercaptioning: image captioning using two-dimensional word embedding. arXiv preprint arXiv:1905.10515 (2019)
Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8, 218386–218400 (2020). https://doi.org/10.1109/ACCESS.2020.3042484
Article Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, July 2018
Google Scholar
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology, Allahabad, Allahabad, India
Komal Garg, Varsha Singh & Uma Shanker Tiwary

Authors

Komal Garg
View author publications
You can also search for this author in PubMed Google Scholar
Varsha Singh
View author publications
You can also search for this author in PubMed Google Scholar
Uma Shanker Tiwary
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Komal Garg .

Editor information

Editors and Affiliations

Kent State University, Kent, OH, USA
Jong-Hoon Kim
University of Tartu, Tartu, Estonia
Madhusudan Singh
Kent State University, Kent, OH, USA
Javed Khan
Indian Institute of Information Technology, Allahabad, India
Uma Shanker Tiwary
Massachusetts Institute of Technology, Cambridge, MA, USA
Marigankar Sur
Hankuk University of Foreign Studies, Seoul, Korea (Republic of)
Dhananjay Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garg, K., Singh, V., Tiwary, U.S. (2022). Textual Description Generation for Visual Content Using Neural Networks. In: Kim, JH., Singh, M., Khan, J., Tiwary, U.S., Sur, M., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2021. Lecture Notes in Computer Science, vol 13184. Springer, Cham. https://doi.org/10.1007/978-3-030-98404-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-98404-5_2
Published: 20 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98403-8
Online ISBN: 978-3-030-98404-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Textual Description Generation for Visual Content Using Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Story Telling: Learning to Visualize Sentences Through Generated Scenes

Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Image Description Generation Using Deep Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Textual Description Generation for Visual Content Using Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Story Telling: Learning to Visualize Sentences Through Generated Scenes

Customizable Text-to-Image Modeling by Contrastive Learning on Adjustable Word-Visual Pairs

Image Description Generation Using Deep Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation