[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Image captioning for cultural artworks: a case study on ceramics

Published: 23 September 2023 Publication History

Abstract

When viewing ancient artworks, people try to build connections with them to ‘read’ the correct messages from the past. A proper descriptive caption is essential for viewers to attain universal understanding and cognitive appreciation. Recent advance in tailoring deep learning for image analysis predominately focuses on generating captions for natural images. However, these relevant techniques are ill-suited for interpreting ancient artworks, which exhibit differential appearances, various design functions, and more importantly, implicit cultural metaphors, hardly summarized in a short caption/sentence. This work presents the design and implementation of a novel framework, termed as ARTalk, for comprehensive image captioning for ancient artworks, with ceramics as the running case. First, we launch an exploratory study on understanding ancient artwork captions, elaborate 15 factors via semi-structural discussion with experts, and form a dedicated caption template with statistical importance analysis on factors. Second, we build a dataset (i.e., CArt15K) with factor-granularity annotations on visuals and texts of ceramics. Third, we jointly fine-tune multiple deep networks for automatic factor extraction and construct a knowledge graph for metaphor inference. We train the networks on CArt15K, evaluate performance by comparing with the baselines, and conduct qualitative analysis on practical generation. We have also implemented a prototype of ARTalk for interactively assisting experts in caption generation. We will release the CArt15K dataset for further research.

References

[1]
Gleason C, Fiannaca AJ, Kneisel M, Cutrell E, and Morris MR Footnotes: Geo-referenced audio annotations for nonvisual exploration Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2018 2 3 1-24
[2]
Biswal, S., Xiao, C., Glass, L.M., Westover, B., Sun, J.: Clara: Clinical report auto-completion. In: Proceedings of The Web Conference, pp. 541–550. ACMPress, TaiPei, China (2020)
[3]
Gonthier, N., Gousseau, Y., Ladjal, S., Bonfait, O.: Weakly supervised object detection in artworks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Zurich, Switzerland (2018)
[4]
Sheng, S., Laenen, K., Moens, M.-F.: Can image captioning help passage retrieval in multimodal question answering? In: European Conference on Information Retrieval, pp. 94–101. ACMPress, Stavanger, Norway (2019). Springer
[5]
Sheng, S., Venkitasubramanian, A.N., Moens, M.-F.: A markov network based passage retrieval method for multimodal question answering in the cultural heritage domain. In: International Conference on Multimedia Modeling, pp. 3–15. ACMPress, Prague, Czech Republic (2018). Springer
[6]
Wynen D, Schmid C, and Mairal J Unsupervised learning of artistic styles with archetypal style analysis Adv. Neural. Inf. Process. Syst. 2018 31 6584-6593
[7]
Chu W-T and Wu Y-L Image style classification based on learnt deep correlation features IEEE Trans. Multimedia 2018 20 9 2491-2502
[8]
Yang H and Min K Classification of basic artistic media based on a deep convolutional approach Vis. Comput. 2020 36 3 559-578
[9]
Sheng, S., Moens, M.-F.: Generating captions for images of ancient artworks. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2478–2486. ACMPress, Nice, France (2019)
[10]
Bai, Z., Nakashima, Y., Garcia, N.: Explain me the painting: multi-topic knowledgeable art description generation. In: In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5422–5432. IEEE, Montreal, Canada (2021)
[11]
Bai S and An S A survey on automatic image caption generation Neurocomputing 2018 311 291-304
[12]
Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision, pp. 529–545. Springer, Zurich, Switzerland (2014). Springer
[13]
Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2596–2604. IEEE, Los Alamitos, Washington, Tokyo (2015)
[14]
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228. ACM, Portland Oregon (2011)
[15]
Dhar, G.K.V.P.S., Li, S., Tamara, Y.C.A.C.B., Berg, L.: Baby talk: understanding and generating simple image descriptions. Comput. Vis. Pattern Recogn. 35(12), 2891–2903 (2013).
[16]
Shen X, Liu B, Zhou Y, Zhao J, and Liu M Remote sensing image captioning via variational autoencoder and reinforcement learning Knowl.-Based Syst. 2020 203
[17]
Liu F, Zhang M, Zheng B, Cui S, Ma W, and Liu Z Feature fusion via multi-target learning for ancient artwork captioning Information Fusion 2023 97
[18]
Feng Q, Wu Y, Fan H, Yan C, Xu M, and Yang Y Cascaded revision network for novel object captioning IEEE Trans. Circ. Syst. Video Technol. 2020 30 10 3413-3421
[19]
Liu M, Li L, Hu H, Guan W, and Tian J Image caption generation with dual attention mechanism Inf. Process. Manage. 2020 57 2
[20]
Xu, L., Merono-Penuela, A., Huang, Z., Van Harmelen, F.: An ontology model for narrative image annotation in the field of cultural heritage. In: Proceedings of the 2nd Workshop on Humanities in the Semantic Web (WHiSe), pp. 15–26. ISWC, Vienna, Austria (2017)
[21]
Xu L and Wang X Semantic description of cultural digital images: using a hierarchical model and controlled vocabulary D-Lib Magazine 2015 21 5-6
[22]
Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding with multi-modal retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
[23]
Cetinic, E.: Iconographic image captioning for artworks. In: International Conference on Pattern Recognition, pp. 502–516. Springer, Munich, Germany (2021). Springer
[24]
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325 (2017)
[25]
Che W, Fan X, Xiong R, and Zhao D Visual relationship embedding network for image paragraph generation IEEE Trans. Multimedia 2019 22 9 2307-2320
[26]
Guo D, Lu R, Chen B, Zeng Z, and Zhou M Matching visual features to hierarchical semantic topics for image paragraph captioning Int. J. Comput. Vis. 2022 130 8 1920-1937
[27]
Chatterjee, M., Schwing, A.G.: Diverse and coherent paragraph generation from images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 729–744 (2018)
[28]
Zeng X-H, Liu B-G, and Zhou M Understanding and generating ultrasound image description J. Comput. Sci. Technol. 2018 33 5 1086-1100
[29]
Qian, X., Koh, E., Du, F., Kim, S., Chan, J.: A formative study on designing accurate and natural figure captioning systems. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8. ACM, Honolulu HI USA (2020)
[30]
Hodosh M, Young P, and Hockenmaier J Framing image description as a ranking task: data, models and evaluation metrics J. Artif. Intell. Res. 2013 47 853-899
[31]
Zhu X, Li L, Liu J, Peng H, and Niu X Captioning transformer with stacked attention modules Appl. Sci. 2018 8 5 739
[32]
Gannon MJ Cultural metaphors: their use in management practice as a method for understanding cultures Online Read. Psychol. Culture 2011 7 4
[33]
Wilber, M.J., Fang, C., Jin, H., Hertzmann, A., Collomosse, J., Belongie, S.: Bam! the behance artistic media dataset for recognition beyond photography. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1202–1211. IEEE, Venice, Italy (2017)
[34]
Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: a new visual-semantic dataset with visual and contextual sentences in the artistic domain. In: International Conference on Image Analysis and Processing, pp. 729–740. Springer, Trento, Italy (2019)
[35]
Li, Q., Yin, J., Wang, Y.: An image comment method based on emotion capture module. In: 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC), pp. 334–339. IEEE, Qingdao, China (2021)
[36]
Carraggi, A., Cornia, M., Baraldi, L., Cucchiara, R.: Visual-semantic alignment across domains using a semi-supervised approach. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Springer, Munich, Germany (2018)
[37]
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
[38]
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
[39]
Shaoyin X and Ke X Chinese Porcelain Dictionary 2019 Beijing China Culture and History Press
[40]
Xiaodong L and Xue W Cultural Relics 2005 Beijing Xueyuan Press
[41]
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACM, Philadelphia Pennsylvania (2002)
[42]
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)
[43]
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, pp. 65–72. ACM, Ann Arbor (2005)
[44]
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. ACL, Ann Arbor, Michigan (2015)
[45]
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE, Boston, MA, USA (2015)
[46]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. ACM, Lille, France (2015). PMLR
[47]
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902. IEEE, Venice, Italy (2017)
[48]
Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion rnn-lstm architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452 (2016)
[49]
Fudholi, D.H., Windiatmoko, Y., Afrianto, N., Susanto, P.E., Suyuti, M., Hidayatullah, A.F., Rahmadi, R.: Image captioning with attention for smart local tourism using efficientnet. In: IOP Conference Series: Materials Science and Engineering, vol. 1077, p. 012038 (2021)
[50]
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)

Cited By

View all

Index Terms

  1. Image captioning for cultural artworks: a case study on ceramics
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Multimedia Systems
          Multimedia Systems  Volume 29, Issue 6
          Dec 2023
          800 pages

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 23 September 2023
          Accepted: 29 August 2023
          Received: 06 March 2023

          Author Tags

          1. Image captioning
          2. Artwork analysis
          3. Feature extraction
          4. Cultural metaphor

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media