[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Semantic-alignment transformer and adversary hashing for cross-modal retrieval

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep Cross-Modal Hashing (DCMH) has garnered significant attention in the field of cross-modal retrieval due to its advantages such as high computational efficiency and small storage space. However, existing DCMH methods still face certain limitations: (1) they neglect the correlation between labels, while label features exhibit high sparsity; (2) they lack fine-grained semantic alignment; (3) they fail to effectively address data imbalance. In order to tackle these issues, this paper introduces a framework named Semantic-Alignment Transformer and Adversary Hashing for Cross-modal Retrieval (SATAH). To the best of our knowledge, this is the first attempt at the Semantic-Alignment Transformer algorithm. Specifically, this paper first designs a label learning network that utilizes a crafted transformer module to extract label information, guiding adversarial learning and hash function learning accordingly. Subsequently, a Balanced Conditional Generative Adversarial Network (BCGAN) is constructed, marking the first instance of adversarial training guided by label information. Furthermore, a Weighted Semi-Hard Cosine Triplet Constraint is proposed to better ensure high-ranking similarity relationships among all items. Lastly, considering the correlation between labels, a semantic-alignment constraint is introduced to handle label correlation from a fine-grained perspective, capturing similarity on a global scale more effectively. Extensive experiments are conducted on multiple representative cross-modal datasets. In experiments with 64-bit hash code length, SATAH achieves average mAP values of 84.75%, 68.87%, and 68.73% on MIR Flickr, NUS-WIDE, and MS COCO datasets, respectively, outperforming state-of-the-art methods. The code is available at https://github.com/Daydaylight/SATAH.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The source code and associated data for the experiments conducted in this study are publicly accessible and can be obtained from the following GitHub repository: https://github.com/Daydaylight/SATAH. Researchers and interested parties are encouraged to visit the provided link to access the complete dataset and source code used in this research. This repository serves as a comprehensive resource for replicating the experiments, validating the findings, and further exploring the methods employed in the study.

References

  1. Xia D, Miao L, Fan A (2020) A cross-modal multimedia retrieval method using depth correlation mining in big data environment. Multimed Tools Appl 79:1339–1354. https://doi.org/10.1007/s11042-019-08238-0

    Article  Google Scholar 

  2. Ren P, Xiao Y, Chang X, Huang P-Y, Li Z, Chen X, Wang X (2021) A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput Surv. https://doi.org/10.1145/3447582

    Article  Google Scholar 

  3. Wang M, Fu W, He X, Hao S, Wu X (2020) A survey on large-scale machine learning. IEEE Trans Knowl Data Eng 34(6):2574–2594. https://doi.org/10.1109/TKDE.2020.3015777

    Article  Google Scholar 

  4. Zhang Z, Liu L, Luo Y, Huang Z, Shen F, Shen HT, Lu G (2020) Inductive structure consistent hashing via flexible semantic calibration. IEEE Trans Neural Netw Learn Syst 32(10):4514–4528. https://doi.org/10.1109/tnnls.2020.3018790

    Article  Google Scholar 

  5. Ye Z, Peng Y (2019) Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans Multimed Comput Commun Appl (TOMM) 15(4):1–20. https://doi.org/10.1145/3356338

    Article  Google Scholar 

  6. Wang Y, Luo X, Nie L, Song J, Zhang W, Xu X-S (2020) Batch: a scalable asymmetric discrete cross-modal hashing. IEEE Trans Knowl Data Eng 33(11):3507–3519. https://doi.org/10.1109/tkde.2020.2974825

    Article  Google Scholar 

  7. Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312

  8. Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050

    Article  Google Scholar 

  9. Liu X, Hu Z, Ling H, Cheung Y-m (2019) Mtfh: a matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 43(3):964–981. https://doi.org/10.1109/tpami.2019.2940446

    Article  Google Scholar 

  10. Zhang Z, Wang X, Lu G, Shen F, Zhu L (2021) Targeted attack of deep hashing via prototype-supervised adversarial networks. IEEE Trans Multimed 24:3392–3404. https://doi.org/10.1109/tmm.2021.3097506

    Article  Google Scholar 

  11. Wang X, Zhang Z, Wu B, Shen F, Lu G (2021) Prototype-supervised adversarial network for targeted attack of deep hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16357–16366. https://doi.org/10.1109/cvpr46437.2021.01609

  12. Huang F, Zhang L, Yang Y, Zhou X (2020) Probability weighted compact feature for domain adaptive retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9582–9591. https://doi.org/10.1109/cvpr42600.2020.00960

  13. Shen F, Shen C, Liu W, Tao Shen H (2015) Supervised discrete hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 37–45. https://doi.org/10.1109/cvpr.2015.7298598

  14. Tang J, Li Z, Wang M, Zhao R (2015) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans Image Process 24(9):2827–2840. https://doi.org/10.1109/tip.2015.2421443

    Article  MathSciNet  Google Scholar 

  15. Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Deep collaborative multi-view hashing for large-scale image search. IEEE Trans Image Process 29:4643–4655. https://doi.org/10.1109/tip.2020.2974065

    Article  MathSciNet  Google Scholar 

  16. Liu X, Zeng H, Shi Y, Zhu J, Hsia C-H, Ma K-K (2023) Deep cross-modal hashing based on semantic consistent ranking. IEEE Trans Multimed. https://doi.org/10.1109/tmm.2023.3254199

    Article  Google Scholar 

  17. Lu X, Zhu L, Cheng Z, Song X, Zhang H (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process 154:217–231. https://doi.org/10.1016/j.sigpro.2018.09.007

    Article  Google Scholar 

  18. Yang F, Liu Y, Ding X, Ma F, Cao J (2022) Asymmetric cross-modal hashing with high-level semantic similarity. Pattern Recognit 130:108823. https://doi.org/10.1016/j.patcog.2022.108823

    Article  Google Scholar 

  19. Wang Y, Chen Z-D, Luo X, Li R, Xu X-S (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077. https://doi.org/10.1109/tcyb.2021.3059886

    Article  Google Scholar 

  20. Hare JS, Lewis PH, Enser PG, Sandom CJ (2006) Mind the gap: Another look at the problem of the semantic gap in image retrieval 6073:75–86. https://doi.org/10.1117/12.647755. SPIE

  21. Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050

    Article  Google Scholar 

  22. Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312

  23. Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52. https://doi.org/10.1145/3372278.3390673

  24. Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479. https://doi.org/10.1109/tmm.2021.3053766

    Article  Google Scholar 

  25. Li Y, Wang X, Qi S, Huang C, Jiang ZL, Liao Q, Guan J, Zhang J (2021) Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval. Signal, Image Vid Process 15:673–680. https://doi.org/10.1007/s11760-019-01534-0

    Article  Google Scholar 

  26. Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240. https://doi.org/10.1109/cvpr.2017.348

  27. Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251. https://doi.org/10.1109/cvpr.2018.00446

  28. Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 159–167. https://doi.org/10.1145/3323873.3325045

  29. Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114. https://doi.org/10.1109/tmm.2020.2969792

    Article  Google Scholar 

  30. Shen F, Yang Y, Liu L, Liu W, Tao D, Shen HT (2017) Asymmetric binary coding for image search. IEEE Trans Multimed 19(9):2022–2032. https://doi.org/10.1109/tmm.2017.2699863

    Article  Google Scholar 

  31. Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5403–5413. https://doi.org/10.1109/cvpr46437.2021.00536

  32. Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inf Technol Electron Eng 22(12):1551–1558. https://doi.org/10.1631/FITEE.2100463

    Article  Google Scholar 

  33. Huang P-Y, Kang G, Liu W, Chang X, Hauptmann AG (2019) Annotation efficient cross-modal retrieval with adversarial attentive alignment. In: Proceedings of the 27th ACM international conference on multimedia, pp 1758–1767. https://doi.org/10.1145/3343031.3350894

  34. Kaur P, Pannu HS, Malhi AK (2021) Comparative analysis on cross-modal information retrieval: a review. Comput Sci Rev 39:100336. https://doi.org/10.1016/j.cosrev.2020.100336

    Article  Google Scholar 

  35. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255. PMLR

  36. Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4094–4102. https://doi.org/10.1109/iccv.2015.466

  37. Tran TQN, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054. https://doi.org/10.1109/cvpr.2016.225

  38. Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599. https://doi.org/10.1109/tip.2018.2852503

    Article  MathSciNet  Google Scholar 

  39. Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proc IEEE 111:257–276. https://doi.org/10.1109/jproc.2023.3238524

    Article  Google Scholar 

  40. Amit Y, Felzenszwalb P, Girshick R (2021) Object detection. In: Computer vision: a reference guide, pp 875–883. https://doi.org/10.1007/978-3-030-63416-2_660

  41. Li Y, Wu C-Y, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814. https://doi.org/10.1109/cvpr52688.2022.00476

  42. Long A, Yin W, Ajanthan T, Nguyen V, Purkait P, Garg R, Blair A, Shen C, Hengel A (2022) Retrieval augmented classification for long-tail visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6959–6969. https://doi.org/10.1109/cvpr52688.2022.00683

  43. Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, vol 1, p 5. https://doi.org/10.24963/ijcai.2018/396

  44. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872. https://doi.org/10.1109/cvpr.2015.7299011

  45. Yang, E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol 31. https://doi.org/10.1609/aaai.v31i1.10719

  46. Cao Y, Long M, Wang J, Yu PS (2017) Correlation hashing network for efficient cross-modal retrieval. BMVC. https://doi.org/10.5244/c.31.128

    Article  Google Scholar 

  47. Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531. https://doi.org/10.1145/3372278.3390711

  48. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162. https://doi.org/10.1145/3123266.3123326

  49. Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22:657–672. https://doi.org/10.1007/s11280-018-0541-x

    Article  Google Scholar 

  50. Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl-Based Syst 180:38–50. https://doi.org/10.1016/j.knosys.2019.05.017

    Article  Google Scholar 

  51. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27

  52. Zhang H, Sindagi V, Patel VM (2019) Image de-raining using a conditional generative adversarial network. IEEE Trans Circuits Syst Vid Technol 30(11):3943–3956. https://doi.org/10.1109/tcsvt.2019.2920407

    Article  Google Scholar 

  53. Peng Y, Qi J (2019) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24. https://doi.org/10.1145/3284750

    Article  Google Scholar 

  54. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  55. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13. Springer

  56. Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R (2021) Early convolutions help transformers see better. In: Advances in neural information processing systems, vol 34, pp 30392–30400

  57. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision 8748–8763. PMLR

  58. Kenton JDM-WC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1, p 2. https://doi.org/10.48550/arXiv.1810.04805

  59. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473. https://doi.org/10.1109/iccv.2019.00756

  60. Wang C-Y, Liao H-YM, Wu Y-H, Chen P-Y, Hsieh J-W, Yeh I-H (2020) Cspnet: a new backbone that can enhance learning capability of cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 390–391. https://doi.org/10.1109/cvprw50498.2020.00203

  61. Shen X, Chen Y, Pan S, Liu W, Zheng Y (2023) Graph convolutional incomplete multi-modal hashing. In: Proceedings of the 31st ACM international conference on multimedia, pp 7029–7037. https://doi.org/10.1145/3581783.3612282

  62. Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval, 2251–2260 https://doi.org/10.1145/3397271.3401430

  63. Li S, Li X, Lu J, Zhou J (2021) Self-supervised video hashing via bidirectional transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13549–13558. https://doi.org/10.1109/cvpr46437.2021.01334

  64. Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Environ 80:267–284. https://doi.org/10.1016/j.catena.2022.106289

    Article  Google Scholar 

  65. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014, pp 1–12. https://doi.org/10.5244/c.28.6. British Machine Vision Association

  66. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation, pp 39–43. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1460096.1460104

  67. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. CIVR ’09. Association for Computing Machinery, New York, USA. https://doi.org/10.1145/1646396.1646452

  68. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision – ECCV 2014, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48. Springer

  69. Ghaderi A, Abbaszadeh Shahri A, Larsson S (2022) A visualized hybrid intelligent model to delineate swedish fine-grained soil layers using clay sensitivity. CATENA 214:106289. https://doi.org/10.1016/j.catena.2022.106289

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Wang.

Ethics declarations

Conflicts of interest

The authors declare no conflicts of interest that could potentially influence the outcome or interpretation of the research reported in this manuscript.

Competing Interests

The authors declare no competing interests related to the publication of this manuscript.

Informed consent

Informed consent was obtained from all human participants involved in this study.

Research involving Human Participants and/or Animals

This study did not involve human participants or animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, Y., Wang, M. & Ma, Y. Semantic-alignment transformer and adversary hashing for cross-modal retrieval. Appl Intell 54, 7581–7602 (2024). https://doi.org/10.1007/s10489-024-05501-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05501-2

Keywords

Navigation