[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Set of Diverse Queries With Uncertainty Regularization for Composed Image Retrieval

Published: 01 October 2024 Publication History

Abstract

Composed image retrieval aims to search a target image by concurrently understanding the composed inputs with a reference image and the complementary modification text. It aims to find a shared latent space where the representation of the composed inputs is close to the desired target image. Most previous methods capture the one-to-one correspondence between the composed inputs and target image, which encodes the composed inputs and the target image into single points in the feature space. However, the one-to-one correspondence cannot effectively handle this task due to the inherent ambiguity problem arising from the various semantic meanings and data uncertainty. Specifically, the composed inputs and target image always exhibit various semantic meanings, affecting the retrieval results. Moreover, given the composed inputs (resp. target image), there are multiple target images (resp. composed inputs) that equally make sense. In this paper, we propose a novel method termed Set of Diverse Queries with Uncertainty Regularization (SDQUR) to solve such inherent ambiguity problem. First, we utilize diverse queries to adaptively aggregate the composed inputs and target image into multiple deterministic embeddings that capture different semantic meanings in the triplet affecting the retrieval process. It can exploit the deterministic many-to-many correspondence within each triple through these set-based queries. Moreover, we provide an uncertainty regularization module to encode the composed inputs and target image into gaussian distribution. Multiple potential positive candidates are sampled from the distribution for probabilistic many-to-many correspondence. Through the complementary deterministic and probabilistic many-to-many correspondence manner, we achieve consistent improvements on the standard FashionIQ, CIRR, and Shoes benchmarks, surpassing the state-of-the-art methods by a large margin.

References

[1]
Y. Song and M. Soleymani, “Polysemous visual-semantic embedding for cross-modal retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1979–1988. 10.1109/CVPR.2019.00208.
[2]
S. Chun, S. J. Oh, R. S. De Rezende, Y. Kalantidis, and D. Larlus, “Probabilistic embeddings for cross-modal retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 8415–8424. 10.1109/CVPR46437.2021.00831.
[3]
S. Chun, “Improved probabilistic image-text representations,” 2023, arXiv:2305.18171.
[4]
J. Wei, Y. Yang, X. Xu, J. Song, G. Wang, and H. T. Shen, “Less is better: Exponential loss for cross-modal matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 5271–5280, Sep. 2023. 10.1109/TCSVT.2023.3249754.
[5]
J. Zhu, P. Zeng, L. Gao, G. Li, D. Liao, and J. Song, “Complementarity-aware space learning for video-text retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4362–4374, 2023. 10.1109/TCSVT.2023.3235523.
[6]
D. Kim, N. Kim, and S. Kwak, “Improving cross-modal retrieval with set of diverse embeddings,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 23422–23431. 10.1109/CVPR52729.2023.02243.
[7]
K. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., vol. 11208, 2018, pp. 212–228. 10.1007/978-3-030-01225-0_13.
[8]
H. Li, Y. Bin, J. Liao, Y. Yang, and H. T. Shen, “Your negative may not be true negative: Boosting image-text matching with false negative elimination,” in Proc. 31st ACM Int. Conf. Multimedia, Oct. 2023, pp. 924–934. 10.1145/3581783.3612101.
[9]
H. Zhai, S. Lai, H. Jin, X. Qian, and T. Mei, “Deep transfer hashing for image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 2, pp. 742–753, Feb. 2021. 10.1109/TCSVT.2020.2991171.
[10]
J. Zhang and Y. Peng, “SSDH: Semi-supervised deep hashing for large scale image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 1, pp. 212–225, Jan. 2019. 10.1109/TCSVT.2017.2771332.
[11]
L. Zhang, F. Liu, and D. Zhang, “Adversarial view confusion feature learning for person re-identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1490–1502, Apr. 2021. 10.1109/TCSVT.2020.3002956.
[12]
H. Su, P. Wang, L. Liu, H. Li, Z. Li, and Y. Zhang, “Where to look and how to describe: Fashion image retrieval with an attentional heterogeneous bilinear network,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 8, pp. 3254–3265, Aug. 2021. 10.1109/TCSVT.2020.3034981.
[13]
H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 3476–3485. 10.1109/ICCV.2017.374.
[14]
N. Vo et al., “Composing text and image for image retrieval—An empirical Odyssey,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 6432–6441. 10.1109/CVPR.2019.00660.
[15]
W. Deng, L. Zheng, Y. Sun, and J. Jiao, “Rethinking triplet loss for domain adaptation,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 1, pp. 29–37, Jan. 2021. 10.1109/TCSVT.2020.2968484.
[16]
M. U. Anwaar, E. Labintcev, and M. Kleinsteuber, “Compositional learning of image-text query for image retrieval,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 1139–1148. 10.1109/WACV48630.2021.00118.
[17]
H. Wen, X. Zhang, X. Song, Y. Wei, and L. Nie, “Target-guided composed image retrieval,” in Proc. 31st ACM Int. Conf. Multimedia, Oct. 2023, pp. 915–923. 10.1145/3581783.3611817.
[18]
C. Gu, J. Bu, Z. Zhang, Z. Yu, D. Ma, and W. Wang, “Image search with text feedback by deep hierarchical attention mutual information maximization,” in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021, pp. 4600–4609. 10.1109/CVPR.2019.00660.
[19]
E. Dodds, J. Culpepper, S. Herdade, Y. Zhang, and K. Boakye, “Modality-agnostic attention fusion for visual search with text feedback,” 2020, arXiv:2007.00145.
[20]
H. Pang, S. Wei, G. Zhang, S. Zhang, S. Qiu, and Y. Zhao, “Heterogeneous feature alignment and fusion in cross-modal augmented space for composed image retrieval,” IEEE Trans. Multimedia, vol. 25, pp. 6446–6457, 2023. 10.1109/TMM.2022.3208742.
[21]
S. Lee, D. Kim, and B. Han, “CoSMo: Content-style modulation for image retrieval with text feedback,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 802–812.
[22]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, vol. 16, 2016, pp. 770–778. 10.1109/ICCV.2017.163.
[23]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. 10.1162/neco.1997.9.8.1735.
[24]
M. Hosseinzadeh and Y. Wang, “Composed query image retrieval using locally bounded features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3593–3602. 10.1109/CVPR42600.2020.00365.
[25]
H. Wen, X. Song, X. Yang, Y. Zhan, and L. Nie, “Comprehensive linguistic-visual composition network for image retrieval,” in Proc. 44th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2021, pp. 1369–1378. 10.1145/3404835.3462967.
[26]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 28, 2015, pp. 91–99. 10.1109/CVPR52688.2022.01522.
[27]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021, arXiv:2103.00020.
[28]
J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12888–12900.
[29]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proc. Int. Conf. Mach. Learn., vol. 202, 2023, pp. 19730–19742.
[30]
X. Li et al., “OSCAR: Object-semantics aligned pre-training for vision-language tasks,” in Proc. IEEE/CVF Eur. Conf. Comput. Vis., Oct. 2020, pp. 121–137. 10.1007/978-3-030-58577-8.
[31]
P. Zhang et al., “VinVL: Revisiting visual representations in vision-language models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 5579–5588. 10.1109/CVPR46437.2021.00553.
[32]
A. Baldrati, M. Bertini, T. Uricchio, and A. Del Bimbo, “Effective conditioned and composed image retrieval combining CLIP-based features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 21434–21442. 10.1109/CVPR52688.2022.02080.
[33]
X. Han, X. Zhu, L. Yu, L. Zhang, Y.-Z. Song, and T. Xiang, “FAME-ViL: Multi-tasking vision-language model for heterogeneous fashion tasks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 2669–2680. 10.1109/CVPR52729.2023.00262.
[34]
Y. Zhao, Y. Song, and Q. Jin, “Progressive learning for image retrieval with hybrid-modality queries,” in Proc. 45th Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2022, pp. 1012–1021. 10.1109/CVPR52688.2022.01371.
[35]
S. Li, X. Xu, X. Jiang, F. Shen, X. Liu, and H. T. Shen, “Multi-grained attention network with mutual exclusion for composed query-based image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 4, pp. 2959–2972, Apr. 2024. 10.1109/TCSVT.2023.3306738.
[36]
J. Kim, Y. Yu, H. Kim, and G. Kim, “Dual compositional learning in interactive image retrieval,” in Proc. AAAI Conf. Artif. Intell., 2021, vol. 35, no. 2, pp. 1771–1779. 10.1609/AAAI.V35I2.16271.
[37]
F. Zhang, M. Yan, J. Zhang, and C. Xu, “Comprehensive relationship reasoning for composed query based image retrieval,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 4655–4664. 10.1145/3503161.3548126.
[38]
S. J. Oh et al., “Modeling uncertainty with hedged instance embeddings,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–17.
[39]
A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5574–5584.
[40]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2019, pp. 4171–4186. 10.18653/v1/n19-1423.
[41]
J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning in face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5709–5718. 10.1109/CVPR42600.2020.00575.
[42]
A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9876–9886. 10.1109/CVPR42600.2020.00990.
[43]
Y. Chen, S. Gong, and L. Bazzani, “Image search with text feedback by visiolinguistic attention learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 2998–3008. 10.1109/CVPR42600.2020.00307.
[44]
K. Saito et al., “Pic2Word: Mapping pictures to words for zero-shot composed image retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 19305–19314. 10.1109/CVPR52729.2023.01850.
[45]
A. Baldrati, L. Agnolucci, M. Bertini, and A. Del Bimbo, “Zero-shot composed image retrieval with textual inversion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 15292–15301. 10.1109/ICCV51070.2023.01407.
[46]
Y. Tang et al., “Context-I2W: Mapping images to context-dependent words for accurate zero-shot composed image retrieval,” in Proc. AAAI Conf. Artif. Intell., 2024, pp. 5180–5188. 10.1609/AAAI.V38I6.28324.
[47]
J. Wei, Y. Yang, X. Xu, X. Zhu, and H. T. Shen, “Universal weighting metric learning for cross-modal retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6534–6545, Oct. 2022. 10.1109/TPAMI.2021.3088863.
[48]
Y. Chen, Z. Zheng, W. Ji, L. Qu, and T.-S. Chua, “Composed image retrieval with text feedback via multi-grained uncertainty regulariza,” in Proc. Int. Conf. Learn. Represent., 2024, pp. 1–14.
[49]
Y. Bin, Y. Ding, B. Peng, L. Peng, Y. Yang, and T.-S. Chua, “Entity slot filling for visual captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 52–62, Jan. 2022. 10.1109/TCSVT.2021.3063297.
[50]
L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 4634–4643. 10.1109/ICCV.2019.00473.
[51]
L. Zhao, J. Li, L. Gao, Y. Rao, J. Song, and H. T. Shen, “Heterogeneous knowledge network for visual dialog,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 2, pp. 861–871, Feb. 2023. 10.1109/TCSVT.2022.3207228.
[52]
Z. Ma, Z. Zheng, J. Wei, Y. Yang, and H. T. Shen, “Instance-dictionary learning for open-world object detection in autonomous driving scenarios,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, pp. 3395–3408, May 2024. 10.1109/TCSVT.2023.3322465.
[53]
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” in Proc. The 10th Int. Conf. Learn. Represent., 2022, pp. 1–17.
[54]
S. Goenka et al., “FashionVLP: Vision language transformer for fashion retrieval with feedback,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 14105–14115.
[55]
Z. Liu, W. Sun, Y. Hong, D. Teney, and S. Gould, “Bi-directional training for composed image retrieval via text prompt learning,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2024. 10.1109/WACV57701.2024.00565.
[56]
M. Levy, R. Ben-Ari, N. Darshan, and D. Lischinski, “Data roaming and quality assessment for composed image retrieval,” in Proc. AAAI Conf. Artif. Intell., 2024, vol. 38, no. 4, pp. 2991–2999. 10.1609/aaai.v38i4.28081.
[57]
L. Ventura, A. Yang, C. Schmid, and G. Varol, “COVR: Learning composed video retrieval from web video captions,” in Proc. AAAI Conf. Artif. Intell., 2024, vol. 38, no. 6, pp. 5270–5279. 10.1609/aaai.v38i6.28334.
[58]
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 1–22.
[59]
J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer normalization,” 2016, arXiv:1607.06450.
[60]
D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Int. Conf. Learn. Represent., 2014, pp. 1–14.
[61]
S. J. Oh et al., “Modeling uncertainty with hedged instance embeddings,” 2019, arXiv:1810.00319.
[62]
H. Wu et al., “Fashion IQ: A new dataset towards retrieving images by natural language feedback,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 11302–11312. 10.1109/CVPR46437.2021.01115.
[63]
Z. Liu, C. Rodriguez-Opazo, D. Teney, and S. Gould, “Image retrieval on real-life images with pre-trained vision-and-language models,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2105–2114. 10.1109/ICCV48922.2021.00213.
[64]
T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute discovery and characterization from noisy web data,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 663–676. 10.1007/978-3-642-15549-9_48.
[65]
S. Li, X. Xu, X. Jiang, F. Shen, Z. Sun, and A. Cichocki, “Cross-modal attention preservation with self-contrastive learning for composed query-based image retrieval,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 20, no. 6, pp. 1–22, Jun. 2024. 10.1145/3639469.
[66]
X. Yang, D. Liu, H. Zhang, Y. Luo, C. Wang, and J. Zhang, “Decomposing semantic shifts for composed image retrieval,” in Proc. AAAI Conf. Artif. Intell., 2024, vol. 38, no. 7, pp. 6576–6584. 10.1609/aaai.v38i7.28479.
[67]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Circuits and Systems for Video Technology  Volume 34, Issue 10_Part_2
Oct. 2024
761 pages

Publisher

IEEE Press

Publication History

Published: 01 October 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media