More Web Proxy on the site http://driver.im/

research-article

Image Captioning in the Wild: How People Caption Images on Flickr

Authors:

Philipp Blandfort,

Tushar Karayil,

Andreas DengelAuthors Info & Claims

MUSA2 '17: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes

Pages 21 - 29

https://doi.org/10.1145/3132515.3132522

Published: 27 October 2017 Publication History

Abstract

Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

References

[1]

Damian Borth, Christian Schulze, Adrian Ulges, and Thomas M Breuel. 2008. Navidgator-similarity based browsing for image and video databases Annual Conference on Artificial Intelligence. Springer, 22--29.

Digital Library

[2]

Damian Borth, Adrian Ulges, and Thomas M. Breuel. 2010. Relevance filtering meets active learning: improving web-based concept detectors Proceedings of the international conference on Multimedia information retrieval. ACM, 25--34.

Digital Library

[3]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.

[4]

H. Paul Grice. 1975. Logic and Conversation. Speech acts, Peter Cole (Ed.). Syntax and semantics, Vol. Vol. 3. Academic Press, New York, 41--58.

[5]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2015. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571 (2015).

[6]

Sebastian Kalkowski, Christian Schulze, Andreas Dengel, and Damian Borth. 2015. Real-time analysis and visualization of the YFCC100M dataset Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions. ACM, 25--30.

Digital Library

[7]

Tushar Karayil, Philipp Blandfort, Damian Borth, and Andreas Dengel. 2016. Generating Affective Captions using Concept And Syntax Transition Networks Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1111--1115.

Digital Library

[8]

Alireza Koochali, Sebastian Kalkowski, Andreas Dengel, Damian Borth, and Christian Schulze. 2016. Which Languages do People Speak on Flickr? A Language and Geo-Location Study of the YFCC100m Dataset. In Proceedings of the 2016 ACM Workshop on Multimedia COMMONS. ACM, 35--42.

Digital Library

[9]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[10]

Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. CoRR Vol. cs.CL/0205028 (2002). http://arxiv.org/abs/cs.CL/0205028

Digital Library

[11]

Karl Ni, Roger Pearce, Kofi Boakye, Brian Van Essen, Damian Borth, Barry Chen, and Eric Wang. 2015. Large-scale deep learning on the YFCC100M dataset. arXiv preprint arXiv:1502.03409 (2015).

[12]

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.

Digital Library

[13]

Abebe Rorissa. 2010. A comparative study of Flickr tags and index terms in a general image collection. Journal of the Association for Information Science and Technology, Vol. 61, 11 (2010), 2230--2242.

Digital Library

[14]

Pavel Serdyukov, Vanessa Murdock, and Roelof Van Zwol. 2009. Placing flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 484--491.

Digital Library

[15]

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM Vol. 59, 2 (2016), 64--73.

Digital Library

[16]

Olivier Van Laere, Steven Schockaert, and Bart Dhoedt. 2011. Finding locations of flickr resources using language models and similarity search Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 48.

Digital Library

[17]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555 (2014).

[18]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention International Conference on Machine Learning. 2048--2057.

Digital Library

[19]

Sicheng Zhao, Hongxun Yao, Yue Gao, RongRong Ji, and Guiguang Ding. 2016 a. Continuous Probability Distribution Prediction of Image Emotions via Multi-Task Shared Sparse Regression. Vol. PP (10. 2016), 1--1.

[20]

Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016. Predicting Personalized Emotion Perceptions of Social Images Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 1385--1394.

Digital Library

Cited By

Farida Attar Farzana Khan Affan Ansari Mujawar Saklen Abubakr Shaikh Danish Khan (2024)Image Caption Generator using Deep LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17881(540-545)Online publication date: 29-Apr-2024
https://doi.org/10.48175/IJARSCT-17881
Ahsan Habib MAnwar Hussen Wadud MFazlul Karim Patwary MMotiur Rahman MMridha MOkuyama YShin J(2024)Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial NetworksIEEE Access10.1109/ACCESS.2024.343554112(178401-178440)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3435541
Ullah ULee JAn CLee HPark SBaek RChoi H(2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
https://doi.org/10.3390/s22186816
Show More Cited By

Index Terms

Image Captioning in the Wild: How People Caption Images on Flickr

Recommendations

Bright as the Sun: In-depth Analysis of Imagination-Driven Image Captioning
Computer Vision – ACCV 2022
Abstract
Existing studies on image captioning mainly focus on generating “literal” captions based on visual entities in images and their basic properties such as colors and spatial relationships. However, to describe images, humans use not only literal ...
UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning
Computational Collective Intelligence
Abstract
Image Captioning (IC), the task of automatic generation of image captions, has attracted attentions from researchers in many fields of computer science, being computer vision, natural language processing and machine learning in recent years. This ...
CONICA: A Contrastive Image Captioning Framework with Robust Similarity Learning
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Contrastive Language Image Pre-training (CLIP) has recently made significant advancements in image captioning by providing effective multi-modal representation learning capabilities. However, previous studies primarily rely on the language-aligned visual ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MUSA2 '17: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes

October 2017

78 pages

ISBN:9781450355094

DOI:10.1145/3132515

General Chairs:
Xavier Alameda-Pineda
INRIA, France
,
Miriam Redi
Bell Labs, UK
,
Mohammed Soleymani
University of Geneva, Switzerland
,
Nicu Sebe
University of Trento, Italy
,
Shih-Fu Chang
Columbia University, USA
,
Samuel D. Gosling
University of Texas, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 27, 2017

California, Mountain View, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Farida Attar Farzana Khan Affan Ansari Mujawar Saklen Abubakr Shaikh Danish Khan (2024)Image Caption Generator using Deep LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17881(540-545)Online publication date: 29-Apr-2024
https://doi.org/10.48175/IJARSCT-17881
Ahsan Habib MAnwar Hussen Wadud MFazlul Karim Patwary MMotiur Rahman MMridha MOkuyama YShin J(2024)Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial NetworksIEEE Access10.1109/ACCESS.2024.343554112(178401-178440)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3435541
Ullah ULee JAn CLee HPark SBaek RChoi H(2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
https://doi.org/10.3390/s22186816
Frolov SHinz TRaue FHees JDengel A(2021)Adversarial text-to-image synthesis: A reviewNeural Networks10.1016/j.neunet.2021.07.019144(187-209)Online publication date: Dec-2021
https://doi.org/10.1016/j.neunet.2021.07.019
Dengel ADevillers LSchaal L(2021)Augmented Human and Human-Machine Co-evolution: Efficiency and EthicsReflections on Artificial Intelligence for Humanity10.1007/978-3-030-69128-8_13(203-227)Online publication date: 7-Feb-2021
https://doi.org/10.1007/978-3-030-69128-8_13
Karayil TIrfan ARaue FHees JDengel A(2019)Conditional GANs for Image Captioning with SentimentsArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_25(300-312)Online publication date: 9-Sep-2019
https://doi.org/10.1007/978-3-030-30490-4_25
Sharma GKalena PMalde NNair AParkar S(undefined)Visual Image Caption Generator Using Deep LearningSSRN Electronic Journal10.2139/ssrn.3368837
https://doi.org/10.2139/ssrn.3368837

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents