Caption-Based Region Extraction in Images

Palash Agrawal¹⁸,
Rahul Yadav¹⁸,
Vikas Yadav¹⁸,
Kanjar De¹⁸ &
…
Partha Pratim Roy¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1024))

635 Accesses

Abstract

Image captioning and object detection are some of the most growing and popular research areas in the field of computer vision. Almost every upcoming technology uses vision in some way, and with various people researching in the field of object detection, many vision problems which seemed intractable seem close to solved now. But there has been less research in identifying regions associating actions with objects. Dense Image Captioning [8] is one such application, which localizes all the important regions in an image along with their description. Something very similar to normal image captioning, but repeated for every salient region in the image. In this paper, we address the aforementioned problem of detecting regions explaining the query caption. We use edge boxes for efficient object proposals, which we further filter down using a score measure. The object proposals are then captioned using a pretrained Inception [19] model. The captions of each of these regions are checked for similarity with the query caption using the skip-thought vectors [9]. This proposed framework produces interesting and efficient results. We provide a quantitative measure of our experiment by taking the intersection over union (IoU) with the ground truth on the visual genome [10] dataset. By combining the above techniques in an orderly manner, we have been able to achieve encouraging results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 103.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 129.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dense Image Captioning Based on Precise Feature Extraction

A novel key point based ROI segmentation and image captioning using guidance information

Article 12 September 2024

A deep dense captioning framework with joint localization and contextual reasoning

Article 01 September 2021

References

Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1841–1848 (2013)
Google Scholar
Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1558–1570 (2015)
Article Google Scholar
Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2154 (2014)
Google Scholar
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)
Google Scholar
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pham, T.Q.: Non-maximum suppression using fewer than two comparisons per pixel. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 438–451. Springer (2010)
Google Scholar
Pinheiro, P.O., Collobert, R., Dollár, P.: Learning to segment object candidates. In: Advances in Neural Information Processing Systems, pp. 1990–1998 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1879–1886. IEEE (2011)
Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Article Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: European Conference on Computer Vision, pp. 391–405. Springer (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology Roorkee, Roorkee, India
Palash Agrawal, Rahul Yadav, Vikas Yadav, Kanjar De & Partha Pratim Roy

Authors

Palash Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Vikas Yadav
View author publications
You can also search for this author in PubMed Google Scholar
Kanjar De
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kanjar De .

Editor information

Editors and Affiliations

Techno India University, Kolkata, India
Bidyut B. Chaudhuri
Division of Advanced Information Technology and Computer Science, Tokyo University of Agriculture and Technology, Koganei-shi, Tokyo, Japan
Masaki Nakagawa
Department of Computer Science, Indian Institute of Information Technology, Design and Manufacturing, Jabalpur, Madhya Pradesh, India
Pritee Khanna
Department of Mathematics, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Sanjeev Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agrawal, P., Yadav, R., Yadav, V., De, K., Pratim Roy, P. (2020). Caption-Based Region Extraction in Images. In: Chaudhuri, B., Nakagawa, M., Khanna, P., Kumar, S. (eds) Proceedings of 3rd International Conference on Computer Vision and Image Processing. Advances in Intelligent Systems and Computing, vol 1024. Springer, Singapore. https://doi.org/10.1007/978-981-32-9291-8_3

Download citation

DOI: https://doi.org/10.1007/978-981-32-9291-8_3
Published: 20 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-32-9290-1
Online ISBN: 978-981-32-9291-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Caption-Based Region Extraction in Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dense Image Captioning Based on Precise Feature Extraction

A novel key point based ROI segmentation and image captioning using guidance information

A deep dense captioning framework with joint localization and contextual reasoning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Caption-Based Region Extraction in Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dense Image Captioning Based on Precise Feature Extraction

A novel key point based ROI segmentation and image captioning using guidance information

A deep dense captioning framework with joint localization and contextual reasoning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation