Semantic-alignment transformer and adversary hashing for cross-modal retrieval

Yajun Sun¹,
Meng Wang¹ &
Ying Ma¹

294 Accesses
Explore all metrics

Abstract

Deep Cross-Modal Hashing (DCMH) has garnered significant attention in the field of cross-modal retrieval due to its advantages such as high computational efficiency and small storage space. However, existing DCMH methods still face certain limitations: (1) they neglect the correlation between labels, while label features exhibit high sparsity; (2) they lack fine-grained semantic alignment; (3) they fail to effectively address data imbalance. In order to tackle these issues, this paper introduces a framework named Semantic-Alignment Transformer and Adversary Hashing for Cross-modal Retrieval (SATAH). To the best of our knowledge, this is the first attempt at the Semantic-Alignment Transformer algorithm. Specifically, this paper first designs a label learning network that utilizes a crafted transformer module to extract label information, guiding adversarial learning and hash function learning accordingly. Subsequently, a Balanced Conditional Generative Adversarial Network (BCGAN) is constructed, marking the first instance of adversarial training guided by label information. Furthermore, a Weighted Semi-Hard Cosine Triplet Constraint is proposed to better ensure high-ranking similarity relationships among all items. Lastly, considering the correlation between labels, a semantic-alignment constraint is introduced to handle label correlation from a fine-grained perspective, capturing similarity on a global scale more effectively. Extensive experiments are conducted on multiple representative cross-modal datasets. In experiments with 64-bit hash code length, SATAH achieves average mAP values of 84.75%, 68.87%, and 68.73% on MIR Flickr, NUS-WIDE, and MS COCO datasets, respectively, outperforming state-of-the-art methods. The code is available at https://github.com/Daydaylight/SATAH.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Deep fused two-step cross-modal hashing with multiple semantic supervision

Article 28 February 2022

Deep adversarial multi-label cross-modal hashing algorithm

Article 29 July 2023

Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval

Article Open access 04 January 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The source code and associated data for the experiments conducted in this study are publicly accessible and can be obtained from the following GitHub repository: https://github.com/Daydaylight/SATAH. Researchers and interested parties are encouraged to visit the provided link to access the complete dataset and source code used in this research. This repository serves as a comprehensive resource for replicating the experiments, validating the findings, and further exploring the methods employed in the study.

References

Xia D, Miao L, Fan A (2020) A cross-modal multimedia retrieval method using depth correlation mining in big data environment. Multimed Tools Appl 79:1339–1354. https://doi.org/10.1007/s11042-019-08238-0
Article Google Scholar
Ren P, Xiao Y, Chang X, Huang P-Y, Li Z, Chen X, Wang X (2021) A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput Surv. https://doi.org/10.1145/3447582
Article Google Scholar
Wang M, Fu W, He X, Hao S, Wu X (2020) A survey on large-scale machine learning. IEEE Trans Knowl Data Eng 34(6):2574–2594. https://doi.org/10.1109/TKDE.2020.3015777
Article Google Scholar
Zhang Z, Liu L, Luo Y, Huang Z, Shen F, Shen HT, Lu G (2020) Inductive structure consistent hashing via flexible semantic calibration. IEEE Trans Neural Netw Learn Syst 32(10):4514–4528. https://doi.org/10.1109/tnnls.2020.3018790
Article Google Scholar
Ye Z, Peng Y (2019) Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans Multimed Comput Commun Appl (TOMM) 15(4):1–20. https://doi.org/10.1145/3356338
Article Google Scholar
Wang Y, Luo X, Nie L, Song J, Zhang W, Xu X-S (2020) Batch: a scalable asymmetric discrete cross-modal hashing. IEEE Trans Knowl Data Eng 33(11):3507–3519. https://doi.org/10.1109/tkde.2020.2974825
Article Google Scholar
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050
Article Google Scholar
Liu X, Hu Z, Ling H, Cheung Y-m (2019) Mtfh: a matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 43(3):964–981. https://doi.org/10.1109/tpami.2019.2940446
Article Google Scholar
Zhang Z, Wang X, Lu G, Shen F, Zhu L (2021) Targeted attack of deep hashing via prototype-supervised adversarial networks. IEEE Trans Multimed 24:3392–3404. https://doi.org/10.1109/tmm.2021.3097506
Article Google Scholar
Wang X, Zhang Z, Wu B, Shen F, Lu G (2021) Prototype-supervised adversarial network for targeted attack of deep hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16357–16366. https://doi.org/10.1109/cvpr46437.2021.01609
Huang F, Zhang L, Yang Y, Zhou X (2020) Probability weighted compact feature for domain adaptive retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9582–9591. https://doi.org/10.1109/cvpr42600.2020.00960
Shen F, Shen C, Liu W, Tao Shen H (2015) Supervised discrete hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 37–45. https://doi.org/10.1109/cvpr.2015.7298598
Tang J, Li Z, Wang M, Zhao R (2015) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans Image Process 24(9):2827–2840. https://doi.org/10.1109/tip.2015.2421443
Article MathSciNet Google Scholar
Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Deep collaborative multi-view hashing for large-scale image search. IEEE Trans Image Process 29:4643–4655. https://doi.org/10.1109/tip.2020.2974065
Article MathSciNet Google Scholar
Liu X, Zeng H, Shi Y, Zhu J, Hsia C-H, Ma K-K (2023) Deep cross-modal hashing based on semantic consistent ranking. IEEE Trans Multimed. https://doi.org/10.1109/tmm.2023.3254199
Article Google Scholar
Lu X, Zhu L, Cheng Z, Song X, Zhang H (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process 154:217–231. https://doi.org/10.1016/j.sigpro.2018.09.007
Article Google Scholar
Yang F, Liu Y, Ding X, Ma F, Cao J (2022) Asymmetric cross-modal hashing with high-level semantic similarity. Pattern Recognit 130:108823. https://doi.org/10.1016/j.patcog.2022.108823
Article Google Scholar
Wang Y, Chen Z-D, Luo X, Li R, Xu X-S (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077. https://doi.org/10.1109/tcyb.2021.3059886
Article Google Scholar
Hare JS, Lewis PH, Enser PG, Sandom CJ (2006) Mind the gap: Another look at the problem of the semantic gap in image retrieval 6073:75–86. https://doi.org/10.1117/12.647755. SPIE
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050
Article Google Scholar
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312
Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52. https://doi.org/10.1145/3372278.3390673
Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479. https://doi.org/10.1109/tmm.2021.3053766
Article Google Scholar
Li Y, Wang X, Qi S, Huang C, Jiang ZL, Liao Q, Guan J, Zhang J (2021) Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval. Signal, Image Vid Process 15:673–680. https://doi.org/10.1007/s11760-019-01534-0
Article Google Scholar
Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240. https://doi.org/10.1109/cvpr.2017.348
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251. https://doi.org/10.1109/cvpr.2018.00446
Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 159–167. https://doi.org/10.1145/3323873.3325045
Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114. https://doi.org/10.1109/tmm.2020.2969792
Article Google Scholar
Shen F, Yang Y, Liu L, Liu W, Tao D, Shen HT (2017) Asymmetric binary coding for image search. IEEE Trans Multimed 19(9):2022–2032. https://doi.org/10.1109/tmm.2017.2699863
Article Google Scholar
Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5403–5413. https://doi.org/10.1109/cvpr46437.2021.00536
Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inf Technol Electron Eng 22(12):1551–1558. https://doi.org/10.1631/FITEE.2100463
Article Google Scholar
Huang P-Y, Kang G, Liu W, Chang X, Hauptmann AG (2019) Annotation efficient cross-modal retrieval with adversarial attentive alignment. In: Proceedings of the 27th ACM international conference on multimedia, pp 1758–1767. https://doi.org/10.1145/3343031.3350894
Kaur P, Pannu HS, Malhi AK (2021) Comparative analysis on cross-modal information retrieval: a review. Comput Sci Rev 39:100336. https://doi.org/10.1016/j.cosrev.2020.100336
Article Google Scholar
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255. PMLR
Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4094–4102. https://doi.org/10.1109/iccv.2015.466
Tran TQN, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054. https://doi.org/10.1109/cvpr.2016.225
Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599. https://doi.org/10.1109/tip.2018.2852503
Article MathSciNet Google Scholar
Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proc IEEE 111:257–276. https://doi.org/10.1109/jproc.2023.3238524
Article Google Scholar
Amit Y, Felzenszwalb P, Girshick R (2021) Object detection. In: Computer vision: a reference guide, pp 875–883. https://doi.org/10.1007/978-3-030-63416-2_660
Li Y, Wu C-Y, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814. https://doi.org/10.1109/cvpr52688.2022.00476
Long A, Yin W, Ajanthan T, Nguyen V, Purkait P, Garg R, Blair A, Shen C, Hengel A (2022) Retrieval augmented classification for long-tail visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6959–6969. https://doi.org/10.1109/cvpr52688.2022.00683
Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, vol 1, p 5. https://doi.org/10.24963/ijcai.2018/396
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872. https://doi.org/10.1109/cvpr.2015.7299011
Yang, E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol 31. https://doi.org/10.1609/aaai.v31i1.10719
Cao Y, Long M, Wang J, Yu PS (2017) Correlation hashing network for efficient cross-modal retrieval. BMVC. https://doi.org/10.5244/c.31.128
Article Google Scholar
Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531. https://doi.org/10.1145/3372278.3390711
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162. https://doi.org/10.1145/3123266.3123326
Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22:657–672. https://doi.org/10.1007/s11280-018-0541-x
Article Google Scholar
Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl-Based Syst 180:38–50. https://doi.org/10.1016/j.knosys.2019.05.017
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27
Zhang H, Sindagi V, Patel VM (2019) Image de-raining using a conditional generative adversarial network. IEEE Trans Circuits Syst Vid Technol 30(11):3943–3956. https://doi.org/10.1109/tcsvt.2019.2920407
Article Google Scholar
Peng Y, Qi J (2019) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24. https://doi.org/10.1145/3284750
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13. Springer
Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R (2021) Early convolutions help transformers see better. In: Advances in neural information processing systems, vol 34, pp 30392–30400
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision 8748–8763. PMLR
Kenton JDM-WC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1, p 2. https://doi.org/10.48550/arXiv.1810.04805
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473. https://doi.org/10.1109/iccv.2019.00756
Wang C-Y, Liao H-YM, Wu Y-H, Chen P-Y, Hsieh J-W, Yeh I-H (2020) Cspnet: a new backbone that can enhance learning capability of cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 390–391. https://doi.org/10.1109/cvprw50498.2020.00203
Shen X, Chen Y, Pan S, Liu W, Zheng Y (2023) Graph convolutional incomplete multi-modal hashing. In: Proceedings of the 31st ACM international conference on multimedia, pp 7029–7037. https://doi.org/10.1145/3581783.3612282
Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval, 2251–2260 https://doi.org/10.1145/3397271.3401430
Li S, Li X, Lu J, Zhou J (2021) Self-supervised video hashing via bidirectional transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13549–13558. https://doi.org/10.1109/cvpr46437.2021.01334
Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Environ 80:267–284. https://doi.org/10.1016/j.catena.2022.106289
Article Google Scholar
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014, pp 1–12. https://doi.org/10.5244/c.28.6. British Machine Vision Association
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation, pp 39–43. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1460096.1460104
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. CIVR ’09. Association for Computing Machinery, New York, USA. https://doi.org/10.1145/1646396.1646452
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision – ECCV 2014, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48. Springer
Ghaderi A, Abbaszadeh Shahri A, Larsson S (2022) A visualized hybrid intelligent model to delineate swedish fine-grained soil layers using clay sensitivity. CATENA 214:106289. https://doi.org/10.1016/j.catena.2022.106289
Article Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing and Graph Computing, Guangxi University of Science and Technology, WenChang, LiuZhou, 545006, Guangxi, China
Yajun Sun, Meng Wang & Ying Ma

Authors

Yajun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Wang.

Ethics declarations

Conflicts of interest

The authors declare no conflicts of interest that could potentially influence the outcome or interpretation of the research reported in this manuscript.

Competing Interests

The authors declare no competing interests related to the publication of this manuscript.

Informed consent

Informed consent was obtained from all human participants involved in this study.

Research involving Human Participants and/or Animals

This study did not involve human participants or animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, Y., Wang, M. & Ma, Y. Semantic-alignment transformer and adversary hashing for cross-modal retrieval. Appl Intell 54, 7581–7602 (2024). https://doi.org/10.1007/s10489-024-05501-2

Download citation

Accepted: 02 May 2024
Published: 10 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10489-024-05501-2

Semantic-alignment transformer and adversary hashing for cross-modal retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep fused two-step cross-modal hashing with multiple semantic supervision

Deep adversarial multi-label cross-modal hashing algorithm

Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Competing Interests

Informed consent

Research involving Human Participants and/or Animals

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Semantic-alignment transformer and adversary hashing for cross-modal retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep fused two-step cross-modal hashing with multiple semantic supervision

Deep adversarial multi-label cross-modal hashing algorithm

Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval

Explore related subjects

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Competing Interests

Informed consent

Research involving Human Participants and/or Animals

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation