More Web Proxy on the site http://driver.im/

research-article

LCEval: Learned Composite Metric for Caption Evaluation

Authors:

Mohammed Bennamoun,

Syed Afaq Ali ShahAuthors Info & Claims

Volume 127, Issue 10

Pages 1586 - 1610

https://doi.org/10.1007/s11263-019-01206-z

Published: 01 October 2019 Publication History

Abstract

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system level, they fail to do so at the caption level. In this work, we propose a neural network-based learned metric to improve the caption-level caption evaluation. To get a deeper insight into the parameters which impact a learned metric’s performance, this paper investigates the relationship between different linguistic features and the caption-level correlation of the learned metrics. We also compare metrics trained with different training examples to measure the variations in their evaluation. Moreover, we perform a robustness analysis, which highlights the sensitivity of learned and handcrafted metrics to various sentence perturbations. Our empirical analysis shows that our proposed metric not only outperforms the existing metrics in terms of caption-level correlation but it also shows a strong system-level correlation against human assessments.

References

[1]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. OSDI., 16, 265–283.

Digital Library

[2]

Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., & Fermüller, C. (2017). Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding, 173, 33–45.

[3]

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016) Spice: Semantic propositional image caption evaluation. In European conference on computer vision (pp. 382–398). Springer.

[4]

Banerjee, S., & Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65–72).

[5]

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

[6]

Bojar, O., Graham, Y., Kamran, A., & Stanojević, M. (2016). Results of the wmt16 metrics shared task. In Proceedings of the first conference on machine translation: volume 2, shared task papers (vol. 2, pp. 199–231)

[7]

Bojar, O., Helcl, J., Kocmi, T., Libovickỳ, J., Musil, T. (2017). Results of the wmt17 neural MT training task. In Proceedings of the second conference on machine translation (pp. 525–533)

[8]

Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., et al. (2015a). Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325.

[9]

Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P. et al. (2015b). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

[10]

Corston-Oliver, S., Gamon, M., Brockett, C. (2001). A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th annual meeting on association for computational linguistics (pp. 148–155). Association for Computational Linguistics.

Digital Library

[11]

Cui, Y., Yang, G., Veit, A., Huang, X., & Belongie, S. (2018). Learning to evaluate image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5804–5812).

[12]

Dancey, C. P., & Reidy, J. (2004). Statistics without maths for psychology. Harlow: Prentice Hall.

Digital Library

[13]

Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376–380).

[14]

Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers)

[15]

Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., & Hockenmaier, J. et al. (2010). Every picture tells a story: Generating sentences from images. In European conference on computer vision (pp. 15–29). Springer.

[16]

Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256 (2010)

[17]

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

Digital Library

[18]

Hodosh, M., & Hockenmaier, J. (2016). Focused evaluation for image description with binary forced-choice tasks. In Proceedings of the 5th workshop on vision and language (pp. 19–28).

[19]

Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

Digital Library

[20]

Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128–3137).

[21]

Karpathy, A., Joulin, A., & Fei-Fei, L. F. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems (pp. 1889–1897).

[22]

Khosrovian, K., Pfahl, D., & Garousi, V. (2008). Gensim 2.0: A customizable process simulation model for software process evaluation. In International conference on software process (pp. 294–306). Springer.

[23]

Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., & Erdem, E. (2016). Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600.

[24]

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[25]

Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

Digital Library

[26]

Kulesza, A., & Shieber, S. M. (2004). A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th international conference on theoretical and methodological issues in machine translation (pp. 75–84).

[27]

Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903.

Digital Library

[28]

Lin, C. Y. (2004) Rouge: A package for automatic evaluation of summaries. Text summarization branches out.

[29]

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., & Ramanan, D. et al. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.

[30]

Liu, D., & Gildea, D. (2005). Syntactic features for evaluation of machine translation. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 25–32).

[31]

Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370.

[32]

Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (vol. 6).

[33]

Ma, Q., Bojar, O., Graham, Y. (2018). Results of the wmt18 metrics shared task: Both characters and embeddings achieve good performance. In Proceedings of the third conference on machine translation: shared task papers (pp. 671–688).

[34]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

[35]

Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., & Berg, A. et al. (2012). Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th conference of the european chapter of the association for computational linguistics (pp. 747–756). Association for Computational Linguistics.

[36]

Ng, A. Y. (2004). Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on machine learning (p. 78). ICML ’04, ACM, New York, NY, USA. http://doi.acm.org/10.1145/1015330.1015435.

Digital Library

[37]

Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2016). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision, 119(1), 46–59.

Digital Library

[38]

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics.

Digital Library

[39]

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

[40]

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE international conference on computer vision (ICCV) (pp. 2641–2649). IEEE.

[41]

Ritter, S., Long, C., Paperno, D., Baroni, M., Botvinick, M., & Goldberg, A. (2015). Leveraging preposition ambiguity to assess compositional distributional models of semantics. In The fourth joint conference on lexical and computational semantics.

[42]

Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In Proceedings of the IEEE international conference on computer vision (pp. 433–440).

[43]

Sharif, N., White, L., Bennamoun, M., & Shah, S. A. A. (2018a). Nneval: Neural network based evaluation metric. In Proceedings of the 15th European conference on computer vision. Springer Lecture Notes in Computer Science.

[44]

Sharif, N., White, L., Bennamoun, M., Shah, S. A. A. (2018b). Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, student research workshop (pp. 14–20).

[45]

van Miltenburg, E., & Elliott, D. (2017). Room for improvement in automatic image description: An error analysis. arXiv preprint arXiv:1704.04198.

[46]

Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

[47]

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In 2015 IEEE conference on Computer vision and pattern recognition (CVPR) (pp. 3156–3164) IEEE.

[48]

White, L., Togneri, R., Liu, W., & Bennamoun, M. (2015). How well sentence embeddings capture meaning. In Proceedings of the 20th Australasian document computing symposium (pp. 9:1–9:8). ADCS ’15, ACM. http://clic.cimec.unitn.it/marco/publications/ritt er-etal-prepositions-starsem-2015.pdf.

Digital Library

[49]

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., & Salakhudinov, R. et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).

[50]

Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2016). Boosting image captioning with attributes. OpenReview, 2(5), 8.

[51]

You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.

[52]

You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651–4659).

[53]

Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78.

Cited By

Ma JLiu JChai QWang PTao J(2024)Diagram Perception Networks for Textbook Question Answering via Joint OptimizationInternational Journal of Computer Vision10.1007/s11263-023-01954-z132:5(1578-1591)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11263-023-01954-z

Index Terms

LCEval: Learned Composite Metric for Caption Evaluation
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

NNEval: Neural Network Based Evaluation Metric for Image Captioning
Computer Vision – ECCV 2018
Abstract
The automatic evaluation of image descriptions is an intricate task, and it is highly important in the development and fine-grained analysis of captioning systems. Existing metrics to automatically evaluate image captioning systems fail to achieve ...
Deneb: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
Computer Vision – ACCV 2024
Abstract
In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due ...
Image Segmentation Based Visual Security Evaluation
IH&MMSec '16: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security

In this paper we present a metric for visual security evaluation of encrypted images, also known as visual security metric. Such a metric should be able to assess whether an image encryption method is secure or not. In order to consider intelligibility ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision

International Journal of Computer Vision Volume 127, Issue 10

Oct 2019

226 pages

ISSN:0920-5691

Issue’s Table of Contents

Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 October 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma JLiu JChai QWang PTao J(2024)Diagram Perception Networks for Textbook Question Answering via Joint OptimizationInternational Journal of Computer Vision10.1007/s11263-023-01954-z132:5(1578-1591)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11263-023-01954-z

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents