Abstract
Various methods in machine learning have noticeable use in generating descriptive text for images and video frames and processing them. This area has attracted the immense interest of researchers in past years. For text generation, various models contain CNN and RNN combined approaches. RNN works well in language modeling; it lacks in maintaining information for a long time. An LSTM language model can overcome this drawback because of its long-term dependency handling. Here, the proposed methodology is an Encoder-Decoder approach where VGG19 Convolution Neural Network is working as Encoder; LSTM language model is working as Decoder to generate the sentence. The model is trained and tested on the Flickr8K dataset and can generate textual descriptions on a larger dataset Flickr30K with the slightest modifications. The results are generated using BLEU scores (Bilingual Evaluation Understudy Score). A GUI tool is developed to help in the field of child education. This tool generates audio for the generated textual description for images and helps to search for similar content on the internet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chu, Y., Yue, X., Yu, L., Sergei, M., Wang, Z.: Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel. Commun. Mob. Comput. 2020, 1–7 (2020)
Xu, N., et al.: Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multimed. 22(5), 1372–1383 (2020)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR (2015)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM International Conference on Multimedia (MM 2016), pp. 988–997. Association for Computing Machinery, New York (2016)
Aung, S., Pa, W., Nwe, T.: Automatic Myanmar image captioning using CNN and LSTM-based language model. In: 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), Marseille, France (2020)
Hewage, R.: Extract Features, Visualize Filters and Feature Maps in VGG16 and VGG19 CNN Models (2020). https://towardsdatascience.com/extract-features-visualize-filters-and-feature-maps-in-vgg16-and-vgg19-cnn-models-d2da6333edd0?. Accessed 29 June 2021
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. Stanford University (2017)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 311–318 (2002)
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W.: CNN RNN: a unified frame-work for multi-label image classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294 (2016)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, 4651–4659 (2016)
Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. In: CVPR, 1601–1608 (2011)
Alzubi, J.A., Jain, R., Nagrath, P., Satapathy, S., Taneja, S., Gupta, P.: Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. J. Intell. Fuzzy Syst. 40(4), 5761–5769 (2021). https://doi.org/10.3233/jifs-189415
Hossain, M.D.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51(6), 1–36 (2018)
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical Turk. In: NAACL-HLT Workshop 2010, pp. 139–147 (2010)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In: ICLR 2015. arXiv:1412.6632
Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5804–5812 (2018)
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4894–4902 (2017)
Aneja, J., Deshpande, A., Schwing, A.G.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5561–5570 (2018)
Feng, Y., Ma, L., Liu, W., Luo, J.: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4125–4134 (2019)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7008–7024 (2017)
Zhou, Y., Sun, Y., Honavar, V.: Improving image captioning by leveraging knowledge graphs. In: IEEE Winter Conference on Applications of Computer Vision (WACV) 2019, pp. 283–293 (2019). https://doi.org/10.1109/WACV.2019.00036
Tran, K., et al.: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 49–56 (2016)
Sun, B., et al.: Supercaptioning: image captioning using two-dimensional word embedding. arXiv preprint arXiv:1905.10515 (2019)
Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8, 218386–218400 (2020). https://doi.org/10.1109/ACCESS.2020.3042484
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, July 2018
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Garg, K., Singh, V., Tiwary, U.S. (2022). Textual Description Generation for Visual Content Using Neural Networks. In: Kim, JH., Singh, M., Khan, J., Tiwary, U.S., Sur, M., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2021. Lecture Notes in Computer Science, vol 13184. Springer, Cham. https://doi.org/10.1007/978-3-030-98404-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-98404-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98403-8
Online ISBN: 978-3-030-98404-5
eBook Packages: Computer ScienceComputer Science (R0)