Abstract
We propose and experimentally investigate the usefulness of several features for selecting image content (objects) suitable for image captioning. The approach taken explores three broad categories of features, namely geometric, conceptual, and visual. Experiments suggest that widely known geometric ‘rules’ in art–aesthetics or photography (such as the golden ratio or the rule-of-thirds) and facts about the human visual system (such as its wider horizontal angle than its vertical) provide no useful information for the task. Human captioners seem to prefer large, elongated (but not in the golden ratio) objects, positioned near the image center, irrespective of orientation. Conceptually, the preferred objects are either too specific or too general, and animate things are almost always mentioned; furthermore, some evidence is found for selecting diverse objects in order to achieve maximal image coverage in captions. Visual object features such as saliency, depth, edges, entropy, and contrast, are all found to provide useful information. Beyond evaluating features in isolation, we investigate how well these are combined by performing feature and feature-category ablation studies, leading to an effective set of features which can be proven useful for operational systems. Moreover, we propose alternative ways for feature engineering and evaluation, dealing with the drawbacks of the evaluation methodology proposed in past literature.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
This strategy is not arbitrary and has a parallel in information retrieval evaluation. For example, evaluating a text document for its relevance to an information need—even when the last is expressed by more than just a query but with title, description, and narrative (called ‘topic’ in NIST’s annual Text REtrieval Conference or TREC jargon)—is notoriously subjective with a modest agreement across human evaluators. In this respect, TREC evaluations have typically used a majority voting, e.g., if two out of three judges say a document is relevant, then it is taken as relevant.
References
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998 (2017)
Arampatzis, A., Drosatos, G., Efraimidis, P.S.: Versatile query scrambling for private web search. Inf. Retr. J. 18(4), 331–358 (2015)
Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
Barlas, G., Ntonti, M., Arampatzis, A.: Duth at the ImageCLEF 2016 image annotation task: content selection. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016, pp. 279–287 (2016)
Bejan, A.: The golden ratio predicted: vision, cognition and locomotion as a single designin nature. Int. J. Des. Nat. Ecodyn. 4, 97–104 (2009)
Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, pp. 663–676. Springer, Berlin (2010)
Campbell, F.W., Robson, J.: Application of fourier analysis to the visibility of gratings. J. Physiol. 197(3), 551–566 (1968)
Chen, X., Zitnick, C.L.: Mind’s eye: A recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2422–2431 (2015)
Cheng, M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2015)
Cristea, A., Iftene, A.: Using machine learning techniques, textual and visual processing in scalable concept image annotation challenge. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation forum, Évora, Portugal, 5–8 September, 2016. CEUR Workshop Proceedings, vol. 1609, pp. 288–298. CEUR-WS.org (2016)
Dagnelie, G.: Visual Prosthetics: Physiology, Bioengineering, Rehabilitation. Springer, Berlin (2011). https://doi.org/10.1007/978-1-4419-0754-7
Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Computer Vision—ECCV 2006, pp. 288–301. Springer, Berlin (2006)
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. 2009 IEEE Conference on Computer Vision and Pattern Recognition pp. 1778–1785 (2009)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Computer Vision—ECCV 2010, pp. 15–29. Springer, Berlin (2010)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. CoRR arXiv:1611.08002 (2016)
Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2016 scalable concept image annotation task. In: Working Notes of CLEF 2016—Conference and Labs of the Evaluation Forum, Évora, Portugal, 5–8 September, 2016, pp. 254–278 (2016)
Gupta, A., Davis, L.S.: Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In: Computer Vision—ECCV 2008, pp. 16–29. Springer, Berlin (2008)
Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press, Cambridge (1998)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR arXiv:cmp-lg/9709008 (1997)
Klein, S.A., Carney, T., Barghout-Stein, L., Tyler, C.W.: Seven models of masking. In: Human Vision and Electronic Imaging II, vol. 3016, pp. 13–25. International Society for Optics and Photonics (1997)
Kreyszig, E.: Advanced Engineering Mathematics: Maple Computer Guide, 8th edn. Wiley, New York (2000)
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Leacock, C., Chodorow, M., Miller, G.A.: Using corpus statistics and wordnet relations for sense identification. Comput. Linguist. 24(1), 147–165 (1998)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24–27, 1998, pp. 296–304. Morgan Kaufmann (1998)
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR arXiv:1405.0312 (2014)
Liu, A., Xu, N., Zhang, H., Nie, W., Su, Y., Zhang, Y.: Multi-level policy and reward reinforcement learning for image captioning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 821–827. International Joint Conferences on Artificial Intelligence Organization (2018)
Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical and multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015)
Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35, 445–470 (2019)
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. CoRR arXiv:1803.09845 (2018)
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)
Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2623–2631 (2015)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. CoRR arXiv:1109.2378 (2011)
Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., Xu, C.: Learning explicit video attributes from mid-level representation for video captioning. Computer Vis. Image Underst. 163, 126–138 (2017)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. CoRR arXiv:1505.04870 (2015)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, IJCAI 95, Montréal Québec, Canada, August 20–25, vol. 1, pp. 448–453 (1995)
Shapley, R.M., Tolhurst, D.J.: Edge detectors in human vision. J. Physiol. 229(1), 165–183 (1973)
Sobel, I., Feldman, G.: A 3x3 isotropic gradient operator for image processing. A talk at the Stanford Artificial Project in pp. 271–272 (1968)
Villegas, M., Müller, H., García Seco de Herrera, A., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandrea, E., Gaizauskas, R., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J.A., Vidal, E.: General Overview of ImageCLEF at the CLEF 2016 Labs, pp. 267–285. Springer, Cham (2016)
Villegas, M., Müller, H., de Herrera, A.G.S., Schaer, R., Bromuri, S., Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A., Dellandréa, E., Gaizauskas, R.J., Mikolajczyk, K., Puigcerver, J., Toselli, A.H., Sánchez, J., Vidal, E.: General overview of ImageCLEF at the CLEF 2016 labs. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction—7th International Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5–8, 2016, Proceedings, pp. 267–285 (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)
Wang, J., Gaizauskas, R.: Don’t mention the shoe! a learning to rank approach to content selection for image description generation. In: Proceedings of the 9th International Natural Language Generation conference. Association for Computational Linguistics (ACL) (2016)
Wang, J., Gaizauskas, R.J.: Generating image descriptions with gold standard visual inputs: Motivation, evaluation and baselines. In: ENLG 2015 - Proceedings of the 15th European Workshop on Natural Language Generation, 10–11 September 2015, University of Brighton, Brighton, UK, pp. 117–126. The Association for Computer Linguistics (2015)
Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.R.: Image captioning with an intermediate attributes layer. CoRR arXiv:1506.01144 (2015)
Wu, Z., Palmer, M.S.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, 27–30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings, pp. 133–138. Morgan Kaufmann Publishers/ACL (1994)
Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention lstm networks for video captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, pp. 537–545. ACM (2017)
Xu, N., Liu, A.A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Technol. PP, 1–1 (2018)
Yang, Y., Teo, C.L., III, H.D., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 444–454. ACL (2011)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Barlas, G., Veinidis, C. & Arampatzis, A. What we see in a photograph: content selection for image captioning. Vis Comput 37, 1309–1326 (2021). https://doi.org/10.1007/s00371-020-01867-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-020-01867-9