More Web Proxy on the site http://driver.im/

research-article

Dynamic Soft Labeling for Visual Semantic Embedding

Authors:

Yuezun LiAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 220 - 228

https://doi.org/10.1145/3652583.3658068

Published: 07 June 2024 Publication History

Abstract

Visual Semantic Embedding (VSE) is a prominent approach in image-text retrieval, aiming to learn a deep embedding space that aligns visual data with semantic text labels. However, current VSE methods oversimplify the retrieval task, treating it as a binary classification problem with triplet loss constraints. This ignores the semantic correlation between pairs of mismatched samples and fails to capture the similarity gradient between samples. In addition, hard constraints on negative samples with high semantic relevance can be detrimental to the model's representational capabilities. To address these limitations, we propose a novel training strategy that introduces dynamic soft labels without additional annotations. This captures the correlation between positive and negative sample pairs and guides feature representation learning using the Soft Negative Alignment Loss (SNAL). SNAL fully takes into account the influence by similar negative samples, enhancing the representation of cross-modal data. In addition, we propose the Stepwise Negative Decoupling Loss (SNDL) to increase the distance between positive and negative samples. Stepwise decoupling of negative samples can be adaptively distanced based on their semantic relevance to the anchor, resulting in a wider distribution of sample features in the common space. Experiments on Flickr30K and MS-COCO datasets validate the effectiveness of our dynamic soft labeling (DSL) methods, demonstrating the importance of considering complex relationships between sample pairs and the limitations of rigid negative sample categorization based on subjective annotations.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[2]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655--12663.

[3]

Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15789--15798.

[4]

Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020. Adaptive offline quintuplet loss for image-text matching. In European Conference on Computer Vision. Springer, 549--565.

Digital Library

[5]

Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang, and Seong Joon Oh. 2022. Eccv caption: Correcting false negatives by collecting machine-and-humanverified image-caption associations for ms-coco. In European Conference on Computer Vision. Springer, 1--19.

Digital Library

[6]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[9]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).

[10]

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013).

[11]

Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7181--7189.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

[13]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning Semantic Concepts and Order for Image and Sentence Matching. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]

Ming Jin, Huaxiang Zhang, Lei Zhu, Jiande Sun, and Li Liu. 2022. Coarse-tofine dual-level attention for video-text cross modal retrieval. Knowledge-Based Systems 242 (2022), 108354.

Digital Library

[15]

Diederik P Kingma. 2014. A method for stochastic optimization. ArXiv Prepr(2014).

[16]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32--73.

[17]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201--216.

Digital Library

[18]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International conference on computer vision. 4654--4662.

[19]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 641--656.

[20]

Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Ying Jin, and Yufeng Zhang. 2022. Image-Text Retrieval with Binary and Continuous Label Supervision. arXiv preprint arXiv:2210.11319 (2022).

[21]

Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Xijun Xue. 2022. MultiView Visual Semantic Embedding. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23--29 July 2022. IJCAI.

[22]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 740--755.

[23]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921--10930.

[24]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. Hit: Hierarchical transformer with momentum contrast for videotext retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11915--11925.

[25]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).

[26]

Zhengxin Pan, Fangyu Wu, and Bailing Zhang. 2023. Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19275--19284.

[27]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM international conference on multimedia. 1047--1055.

Digital Library

[28]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).

[29]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[30]

Zhangxiang Shi, Tianzhu Zhang, Xi Wei, Feng Wu, and Yongdong Zhang. 2022. Decoupled Cross-modal Phrase-Attention Network for Image-Sentence Matching. IEEE Transactions on Image Processing (2022).

[31]

Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, and Jingdong Wang. 2022. Coder: Coupled diversitysensitive momentum contrastive learning for image-text retrieval. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXVI. Springer, 700--716.

[32]

Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020. Consensusaware visual-semantic embedding for image-text matching. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIV 16. Springer, 18--34.

[33]

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.

[34]

Jonatas Wehrmann, Douglas M Souza, Mauricio A Lopes, and Rodrigo C Barros. 2019. Language-agnostic visual-semantic embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5804--5813.

[35]

Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609--6618.

[36]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088--2096.

Digital Library

[37]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[38]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.

[39]

Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, and Gang Hua. 2020. Ladder loss for coherent visual-semantic embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13050--13057.

Index Terms

Dynamic Soft Labeling for Visual Semantic Embedding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
Abstract
Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods ...
Self-supervised Visual-Semantic Embedding Network Based on Local Label Optimization
Machine Learning for Cyber Security
Abstract
Image-text retrieval has always been an important direction in the field of vision-language understanding, which is dedicated to bridging the semantic gap between two modalities. The existing methods are mainly divided into global visual-semantic ...
MoPE: Mixture of Pooling Experts Framework for Image-Text Retrieval
MultiMedia Modeling
Abstract
Image-text retrieval is a fundamental and crucial task in the field of multimodal interaction, which assists internet users in retrieving the required visual and textual information conveniently. The dominant method for image-text retrieval aims ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China under Grants

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)4

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents