[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3132515.3132522acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Image Captioning in the Wild: How People Caption Images on Flickr

Published: 27 October 2017 Publication History

Abstract

Automatic image captioning is a well-known problem in the field of artificial intelligence. To solve this problem efficiently, it is also required to understand how people caption images naturally (when not instructed by a set of rules, which tell them to do so in a certain way). This dimension of the problem is rarely discussed. To understand this aspect, we performed a crowdsourcing study on specific subsets of the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) where annotators evaluate captions with respect to subjectivity, visibility, appeal and intent. We use the resulting data to systematically characterize the variations in image captions that appear "in the wild". We publish our findings here along with the annotated dataset.

References

[1]
Damian Borth, Christian Schulze, Adrian Ulges, and Thomas M Breuel. 2008. Navidgator-similarity based browsing for image and video databases Annual Conference on Artificial Intelligence. Springer, 22--29.
[2]
Damian Borth, Adrian Ulges, and Thomas M. Breuel. 2010. Relevance filtering meets active learning: improving web-based concept detectors Proceedings of the international conference on Multimedia information retrieval. ACM, 25--34.
[3]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.
[4]
H. Paul Grice. 1975. Logic and Conversation. Speech acts, Peter Cole (Ed.). Syntax and semantics, Vol. Vol. 3. Academic Press, New York, 41--58.
[5]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2015. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571 (2015).
[6]
Sebastian Kalkowski, Christian Schulze, Andreas Dengel, and Damian Borth. 2015. Real-time analysis and visualization of the YFCC100M dataset Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions. ACM, 25--30.
[7]
Tushar Karayil, Philipp Blandfort, Damian Borth, and Andreas Dengel. 2016. Generating Affective Captions using Concept And Syntax Transition Networks Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1111--1115.
[8]
Alireza Koochali, Sebastian Kalkowski, Andreas Dengel, Damian Borth, and Christian Schulze. 2016. Which Languages do People Speak on Flickr? A Language and Geo-Location Study of the YFCC100m Dataset. In Proceedings of the 2016 ACM Workshop on Multimedia COMMONS. ACM, 35--42.
[9]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[10]
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. CoRR Vol. cs.CL/0205028 (2002). http://arxiv.org/abs/cs.CL/0205028
[11]
Karl Ni, Roger Pearce, Kofi Boakye, Brian Van Essen, Damian Borth, Barry Chen, and Eric Wang. 2015. Large-scale deep learning on the YFCC100M dataset. arXiv preprint arXiv:1502.03409 (2015).
[12]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.
[13]
Abebe Rorissa. 2010. A comparative study of Flickr tags and index terms in a general image collection. Journal of the Association for Information Science and Technology, Vol. 61, 11 (2010), 2230--2242.
[14]
Pavel Serdyukov, Vanessa Murdock, and Roelof Van Zwol. 2009. Placing flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 484--491.
[15]
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM Vol. 59, 2 (2016), 64--73.
[16]
Olivier Van Laere, Steven Schockaert, and Bart Dhoedt. 2011. Finding locations of flickr resources using language models and similarity search Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 48.
[17]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555 (2014).
[18]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention International Conference on Machine Learning. 2048--2057.
[19]
Sicheng Zhao, Hongxun Yao, Yue Gao, RongRong Ji, and Guiguang Ding. 2016 a. Continuous Probability Distribution Prediction of Image Emotions via Multi-Task Shared Sparse Regression. Vol. PP (10. 2016), 1--1.
[20]
Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, Wenlong Xie, Xiaolei Jiang, and Tat-Seng Chua. 2016. Predicting Personalized Emotion Perceptions of Social Images Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 1385--1394.

Cited By

View all
  • (2024)Image Caption Generator using Deep LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17881(540-545)Online publication date: 29-Apr-2024
  • (2024)Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial NetworksIEEE Access10.1109/ACCESS.2024.343554112(178401-178440)Online publication date: 2024
  • (2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MUSA2 '17: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes
October 2017
78 pages
ISBN:9781450355094
DOI:10.1145/3132515
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. flickr
  2. image captioning
  3. intent
  4. sentiment
  5. subjectivity
  6. yfcc100m

Qualifiers

  • Research-article

Funding Sources

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 27, 2017
California, Mountain View, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Image Caption Generator using Deep LearningInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-17881(540-545)Online publication date: 29-Apr-2024
  • (2024)Exploring Progress in Text-to-Image Synthesis: An In-Depth Survey on the Evolution of Generative Adversarial NetworksIEEE Access10.1109/ACCESS.2024.343554112(178401-178440)Online publication date: 2024
  • (2022)A Review of Multi-Modal Learning from the Text-Guided Visual Processing ViewpointSensors10.3390/s2218681622:18(6816)Online publication date: 8-Sep-2022
  • (2021)Adversarial text-to-image synthesis: A reviewNeural Networks10.1016/j.neunet.2021.07.019144(187-209)Online publication date: Dec-2021
  • (2021)Augmented Human and Human-Machine Co-evolution: Efficiency and EthicsReflections on Artificial Intelligence for Humanity10.1007/978-3-030-69128-8_13(203-227)Online publication date: 7-Feb-2021
  • (2019)Conditional GANs for Image Captioning with SentimentsArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_25(300-312)Online publication date: 9-Sep-2019
  • (undefined)Visual Image Caption Generator Using Deep LearningSSRN Electronic Journal10.2139/ssrn.3368837

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media