[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Toward Attribute-Controlled Fashion Image Captioning

Published: 23 September 2024 Publication History

Abstract

Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve) and styles (e.g., cool, classic, fresh) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In 2016 Proceedings of the European Conference on Computer Vision. 382–398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In 2005 Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[4]
Chen Cai, Suchen Wang, Kim-Hui Yap, and Yi Wang. 2024. Top-down framework for weakly-supervised grounded image captioning. Knowledge-Based Systems 287 (2024), 111433. DOI:
[5]
Chen Cai, Kim-Hui Yap, and Suchen Wang. 2022. Attribute conditioned fashion image captioning. In 2022 IEEE International Conference on Image Processing. 1921–1925. DOI:
[6]
Long Chen, Zhihong Jiang, Jun Xiao, and Wei Liu. 2021. Human-like controllable image captioning with verb-specific semantic roles. In 2021 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16841–16851. DOI:
[7]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.
[8]
Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9962–9971.
[9]
Wen-Huang Cheng, Sijie Song, Chieh-Yun Chen, Shintami Chusnul Hidayati, and Jiaying Liu. 2021. Fashion meets computer vision: A survey. ACM Computing Surveys 54, 4, Article 72 (Jul 2021), 41 pages. DOI:
[10]
Charles Corbiere, Hedi Ben-Younes, Alexandre Ramé, and Charles Ollion. 2017. Leveraging weakly annotated data for fashion image retrieval and label prediction. In 2017 Proceedings of the IEEE International Conference on Computer Vision Workshops. 2268–2274.
[11]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020). 10578–10587.
[12]
Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, and Alberto Del Bimbo. 2023. Disentangling features for fashion recommendation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 1s (2023), 1–21.
[13]
Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2022. Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 2 (2022), 880–892. https://doi.org/10.1109/TCSVT.2021.3063423
[14]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10695–10704.
[15]
Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In 2021 Proceedings of the 29th ACM International Conference on Multimedia. 2615–2624.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. DOI:
[17]
Zunlei Feng, Zhenyun Yu, Yongcheng Jing, Sai Wu, Mingli Song, Yezhou Yang, and Junxiao Jiang. 2019. Interpretable partitioned embedding for intelligent multi-item fashion outfit composition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–20.
[18]
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3141–3149.
[19]
Lianli Gao, Kaixuan Fan, Jingkuan Song, Xianglong Liu, Xing Xu, and Heng Tao Shen. 2019. Deliberate attention networks for image captioning. In 2019 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8320–8327.
[20]
Longteng Guo, Jing Liu, Shichen Lu, and Hanqing Lu. 2019a. Show, tell, and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia 22, 8 (2019), 2149–2162.
[21]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019b. Aligning linguistic words and visual semantic units for image captioning. In 2019 Proceedings of the 27th ACM International Conference on Multimedia. 765–773.
[22]
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327–10336.
[23]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. In 2017 Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[25]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Advances in Neural Information Processing Systems 32 (2019).
[26]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.
[27]
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018a. Learning to guide decoding for image captioning. In 2018 Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 1.
[28]
Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong Zhang. 2018b. Recurrent fusion network for image captioning. In 2018 Proceedings of the European Conference on Computer Vision (ECCV). 499–515.
[29]
Weitao Jiang, Weixuan Wang, and Haifeng Hu. 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4, Article 125 (Nov 2021), 20 pages.
[30]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
[31]
Furkan Kinli, Baris Ozcan, and Furkan Kirac. 2019. Fashion image retrieval with capsule networks. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
[32]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019b. Entangled transformer for image captioning. In 2019 Proceedings of the IEEE/CVF International Conference on Computer Vision. 8928–8937.
[33]
Liang Li, Xingyu Gao, Jincan Deng, Yunbin Tu, Zheng-Jun Zha, and Qingming Huang. 2022. Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing 31 (2022), 2726–2738. DOI:
[34]
Yixin Li, Shengqin Tang, Yun Ye, and Jinwen Ma. 2019a. Spatial-aware non-local attention for fashion landmark detection. In 2019 IEEE International Conference on Multimedia and Expo. 820–825.
[35]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACM, 74–81.
[36]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference. 740–755.
[37]
Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Jun Ma, and Maarten De Rijke. 2019. Explainable outfit recommendation with joint outfit matching and comment generation. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1502–1516.
[38]
Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, and Jing Liu. 2021a. CPTR: Full transformer network for image captioning. arXiv:2101.10804. DOI:
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[40]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1096–1104.
[41]
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level Collaborative Transformer for Image Captioning. 2021 Proceedings of the AAAI Conference on Artificial Intelligence, 2286–2293.
[42]
Nicholas Moratelli, Manuele Barraco, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023a. Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23, 3 (2023). DOI:
[43]
Bao T. Nguyen, Om Prakash, and Anh H. Vo. 2021. Attention mechanism for fashion image captioning. In Computational Intelligence Methods for Green Technology and Sustainable Development: Proceedings of the International Conference (GTSD ’20). 93–104.
[44]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In 2020 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[45]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In 2002 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[46]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.
[47]
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12516–12526.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of 31st Conference on Neural Information Processing Systems. Vol. 30.
[49]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.
[50]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In 2015 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[51]
Changzhi Wang and Xiaodong Gu. 2022. Image captioning with adaptive incremental global context attention. Applied Intelligence 52 (2022), 1–23.
[52]
Zhonghao Wang, Yujun Gu, Ya Zhang, Jun Zhou, and Xiao Gu. 2017. Clothing retrieval with visual attention model. In 2017 IEEE Visual Communications and Image Processing (VCIP). IEEE, 1–4.
[53]
Zhiwei Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2019. R-Transformer: Recurrent Neural Network Enhanced Transformer. CoRR abs/1907.05572. Retrieved from https://doi.org/10.48550/arXiv.1907.05572
[54]
Kejun Wu, You Yang, Qiong Liu, Gangyi Jiang, and Xiao-Ping Zhang. 2023c. Hierarchical independent coding scheme for varifocal multiview images based on angular-focal joint prediction. IEEE Transactions on Multimedia (2023), 1–13. DOI:
[55]
Kejun Wu, You Yang, Qiong Liu, and Xiao-Ping Zhang. 2023b. Focal stack image compression based on basis-quadtree representation. IEEE Transactions on Multimedia 25 (2023), 3975–3988. DOI:
[56]
Ting-Wei Wu, Jia-Hong Huang, Joseph Lin, and Marcel Worring. 2023a. Expert-defined keywords improve interpretability of retinal image captioning. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. 1859–1868. DOI:
[57]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning. 2048–2057.
[58]
Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2022. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 43–51. DOI:
[59]
Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, and Xin Wang. 2020. Fashion captioning: Towards generating accurate descriptions with semantic rewards. In Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part XIII 16. 1–17.
[60]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In 2018 Proceedings of the European Conference on Computer Vision (ECCV). 684–699.
[61]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
[62]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78. DOI:
[63]
Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2020. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4467–4480. DOI:
[64]
Weijiang Yu, Xiaodan Liang, Ke Gong, Chenhan Jiang, Nong Xiao, and Liang Lin. 2019. Layout-graph reasoning for fashion landmark detection. In 2019 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2937–2945.
[65]
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In 2022 Proceedings of the International Joint Conferences on Artificial Intelligence, Vol. 5. DOI:
[66]
Jing Zhang, Zhongjun Fang, Han Sun, and Zhe Wang. 2022. Adaptive semantic-enhanced transformer for image captioning. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–12. DOI:
[67]
Jing Zhang, Zhongjun Fang, and Zhe Wang. 2023. Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Applied Intelligence 53, 11 (2023), 13398–13414.
[68]
Ji Zhang, Kuizhi Mei, Yu Zheng, and Jianping Fan. 2020. Integrating part of speech guidance for image captioning. IEEE Transactions on Multimedia 23 (2020), 92–104.
[69]
Shanshan Zhao, Lixiang Li, and Haipeng Peng. 2023. Incorporating retrieval-based method for feature enhanced image captioning. Applied Intelligence 53, 8 (2023), 9731–9743.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 9
September 2024
780 pages
EISSN:1551-6865
DOI:10.1145/3613681
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2024
Online AM: 05 June 2024
Accepted: 25 May 2024
Revised: 04 February 2024
Received: 27 June 2023
Published in TOMM Volume 20, Issue 9

Check for updates

Author Tags

  1. Fashion
  2. image captioning
  3. controllable
  4. semantic understanding
  5. dataset

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 281
    Total Downloads
  • Downloads (Last 12 months)281
  • Downloads (Last 6 weeks)29
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media