Abstract
Deep Cross-Modal Hashing (DCMH) has garnered significant attention in the field of cross-modal retrieval due to its advantages such as high computational efficiency and small storage space. However, existing DCMH methods still face certain limitations: (1) they neglect the correlation between labels, while label features exhibit high sparsity; (2) they lack fine-grained semantic alignment; (3) they fail to effectively address data imbalance. In order to tackle these issues, this paper introduces a framework named Semantic-Alignment Transformer and Adversary Hashing for Cross-modal Retrieval (SATAH). To the best of our knowledge, this is the first attempt at the Semantic-Alignment Transformer algorithm. Specifically, this paper first designs a label learning network that utilizes a crafted transformer module to extract label information, guiding adversarial learning and hash function learning accordingly. Subsequently, a Balanced Conditional Generative Adversarial Network (BCGAN) is constructed, marking the first instance of adversarial training guided by label information. Furthermore, a Weighted Semi-Hard Cosine Triplet Constraint is proposed to better ensure high-ranking similarity relationships among all items. Lastly, considering the correlation between labels, a semantic-alignment constraint is introduced to handle label correlation from a fine-grained perspective, capturing similarity on a global scale more effectively. Extensive experiments are conducted on multiple representative cross-modal datasets. In experiments with 64-bit hash code length, SATAH achieves average mAP values of 84.75%, 68.87%, and 68.73% on MIR Flickr, NUS-WIDE, and MS COCO datasets, respectively, outperforming state-of-the-art methods. The code is available at https://github.com/Daydaylight/SATAH.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The source code and associated data for the experiments conducted in this study are publicly accessible and can be obtained from the following GitHub repository: https://github.com/Daydaylight/SATAH. Researchers and interested parties are encouraged to visit the provided link to access the complete dataset and source code used in this research. This repository serves as a comprehensive resource for replicating the experiments, validating the findings, and further exploring the methods employed in the study.
References
Xia D, Miao L, Fan A (2020) A cross-modal multimedia retrieval method using depth correlation mining in big data environment. Multimed Tools Appl 79:1339–1354. https://doi.org/10.1007/s11042-019-08238-0
Ren P, Xiao Y, Chang X, Huang P-Y, Li Z, Chen X, Wang X (2021) A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput Surv. https://doi.org/10.1145/3447582
Wang M, Fu W, He X, Hao S, Wu X (2020) A survey on large-scale machine learning. IEEE Trans Knowl Data Eng 34(6):2574–2594. https://doi.org/10.1109/TKDE.2020.3015777
Zhang Z, Liu L, Luo Y, Huang Z, Shen F, Shen HT, Lu G (2020) Inductive structure consistent hashing via flexible semantic calibration. IEEE Trans Neural Netw Learn Syst 32(10):4514–4528. https://doi.org/10.1109/tnnls.2020.3018790
Ye Z, Peng Y (2019) Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans Multimed Comput Commun Appl (TOMM) 15(4):1–20. https://doi.org/10.1145/3356338
Wang Y, Luo X, Nie L, Song J, Zhang W, Xu X-S (2020) Batch: a scalable asymmetric discrete cross-modal hashing. IEEE Trans Knowl Data Eng 33(11):3507–3519. https://doi.org/10.1109/tkde.2020.2974825
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050
Liu X, Hu Z, Ling H, Cheung Y-m (2019) Mtfh: a matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 43(3):964–981. https://doi.org/10.1109/tpami.2019.2940446
Zhang Z, Wang X, Lu G, Shen F, Zhu L (2021) Targeted attack of deep hashing via prototype-supervised adversarial networks. IEEE Trans Multimed 24:3392–3404. https://doi.org/10.1109/tmm.2021.3097506
Wang X, Zhang Z, Wu B, Shen F, Lu G (2021) Prototype-supervised adversarial network for targeted attack of deep hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16357–16366. https://doi.org/10.1109/cvpr46437.2021.01609
Huang F, Zhang L, Yang Y, Zhou X (2020) Probability weighted compact feature for domain adaptive retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9582–9591. https://doi.org/10.1109/cvpr42600.2020.00960
Shen F, Shen C, Liu W, Tao Shen H (2015) Supervised discrete hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 37–45. https://doi.org/10.1109/cvpr.2015.7298598
Tang J, Li Z, Wang M, Zhao R (2015) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans Image Process 24(9):2827–2840. https://doi.org/10.1109/tip.2015.2421443
Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Deep collaborative multi-view hashing for large-scale image search. IEEE Trans Image Process 29:4643–4655. https://doi.org/10.1109/tip.2020.2974065
Liu X, Zeng H, Shi Y, Zhu J, Hsia C-H, Ma K-K (2023) Deep cross-modal hashing based on semantic consistent ranking. IEEE Trans Multimed. https://doi.org/10.1109/tmm.2023.3254199
Lu X, Zhu L, Cheng Z, Song X, Zhang H (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process 154:217–231. https://doi.org/10.1016/j.sigpro.2018.09.007
Yang F, Liu Y, Ding X, Ma F, Cao J (2022) Asymmetric cross-modal hashing with high-level semantic similarity. Pattern Recognit 130:108823. https://doi.org/10.1016/j.patcog.2022.108823
Wang Y, Chen Z-D, Luo X, Li R, Xu X-S (2021) Fast cross-modal hashing with global and local similarity embedding. IEEE Trans Cybern 52(10):10064–10077. https://doi.org/10.1109/tcyb.2021.3059886
Hare JS, Lewis PH, Enser PG, Sandom CJ (2006) Mind the gap: Another look at the problem of the semantic gap in image retrieval 6073:75–86. https://doi.org/10.1117/12.647755. SPIE
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33(10):3351–3365. https://doi.org/10.1109/tkde.2020.2970050
Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035. https://doi.org/10.1109/iccv.2019.00312
Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52. https://doi.org/10.1145/3372278.3390673
Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 24:466–479. https://doi.org/10.1109/tmm.2021.3053766
Li Y, Wang X, Qi S, Huang C, Jiang ZL, Liao Q, Guan J, Zhang J (2021) Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval. Signal, Image Vid Process 15:673–680. https://doi.org/10.1007/s11760-019-01534-0
Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240. https://doi.org/10.1109/cvpr.2017.348
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251. https://doi.org/10.1109/cvpr.2018.00446
Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 159–167. https://doi.org/10.1145/3323873.3325045
Ma X, Zhang T, Xu C (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114. https://doi.org/10.1109/tmm.2020.2969792
Shen F, Yang Y, Liu L, Liu W, Tao D, Shen HT (2017) Asymmetric binary coding for image search. IEEE Trans Multimed 19(9):2022–2032. https://doi.org/10.1109/tmm.2017.2699863
Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5403–5413. https://doi.org/10.1109/cvpr46437.2021.00536
Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inf Technol Electron Eng 22(12):1551–1558. https://doi.org/10.1631/FITEE.2100463
Huang P-Y, Kang G, Liu W, Chang X, Hauptmann AG (2019) Annotation efficient cross-modal retrieval with adversarial attentive alignment. In: Proceedings of the 27th ACM international conference on multimedia, pp 1758–1767. https://doi.org/10.1145/3343031.3350894
Kaur P, Pannu HS, Malhi AK (2021) Comparative analysis on cross-modal information retrieval: a review. Comput Sci Rev 39:100336. https://doi.org/10.1016/j.cosrev.2020.100336
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255. PMLR
Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4094–4102. https://doi.org/10.1109/iccv.2015.466
Tran TQN, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054. https://doi.org/10.1109/cvpr.2016.225
Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599. https://doi.org/10.1109/tip.2018.2852503
Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proc IEEE 111:257–276. https://doi.org/10.1109/jproc.2023.3238524
Amit Y, Felzenszwalb P, Girshick R (2021) Object detection. In: Computer vision: a reference guide, pp 875–883. https://doi.org/10.1007/978-3-030-63416-2_660
Li Y, Wu C-Y, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814. https://doi.org/10.1109/cvpr52688.2022.00476
Long A, Yin W, Ajanthan T, Nguyen V, Purkait P, Garg R, Blair A, Shen C, Hengel A (2022) Retrieval augmented classification for long-tail visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6959–6969. https://doi.org/10.1109/cvpr52688.2022.00683
Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, vol 1, p 5. https://doi.org/10.24963/ijcai.2018/396
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872. https://doi.org/10.1109/cvpr.2015.7299011
Yang, E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol 31. https://doi.org/10.1609/aaai.v31i1.10719
Cao Y, Long M, Wang J, Yu PS (2017) Correlation hashing network for efficient cross-modal retrieval. BMVC. https://doi.org/10.5244/c.31.128
Bai C, Zeng C, Ma Q, Zhang J, Chen S (2020) Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 525–531. https://doi.org/10.1145/3372278.3390711
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162. https://doi.org/10.1145/3123266.3123326
Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22:657–672. https://doi.org/10.1007/s11280-018-0541-x
Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl-Based Syst 180:38–50. https://doi.org/10.1016/j.knosys.2019.05.017
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol 27
Zhang H, Sindagi V, Patel VM (2019) Image de-raining using a conditional generative adversarial network. IEEE Trans Circuits Syst Vid Technol 30(11):3943–3956. https://doi.org/10.1109/tcsvt.2019.2920407
Peng Y, Qi J (2019) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24. https://doi.org/10.1145/3284750
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. https://doi.org/10.1007/978-3-030-58452-8_13. Springer
Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R (2021) Early convolutions help transformers see better. In: Advances in neural information processing systems, vol 34, pp 30392–30400
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision 8748–8763. PMLR
Kenton JDM-WC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1, p 2. https://doi.org/10.48550/arXiv.1810.04805
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473. https://doi.org/10.1109/iccv.2019.00756
Wang C-Y, Liao H-YM, Wu Y-H, Chen P-Y, Hsieh J-W, Yeh I-H (2020) Cspnet: a new backbone that can enhance learning capability of cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 390–391. https://doi.org/10.1109/cvprw50498.2020.00203
Shen X, Chen Y, Pan S, Liu W, Zheng Y (2023) Graph convolutional incomplete multi-modal hashing. In: Proceedings of the 31st ACM international conference on multimedia, pp 7029–7037. https://doi.org/10.1145/3581783.3612282
Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval, 2251–2260 https://doi.org/10.1145/3397271.3401430
Li S, Li X, Lu J, Zhou J (2021) Self-supervised video hashing via bidirectional transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13549–13558. https://doi.org/10.1109/cvpr46437.2021.01334
Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Environ 80:267–284. https://doi.org/10.1016/j.catena.2022.106289
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014, pp 1–12. https://doi.org/10.5244/c.28.6. British Machine Vision Association
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation, pp 39–43. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1460096.1460104
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. CIVR ’09. Association for Computing Machinery, New York, USA. https://doi.org/10.1145/1646396.1646452
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision – ECCV 2014, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48. Springer
Ghaderi A, Abbaszadeh Shahri A, Larsson S (2022) A visualized hybrid intelligent model to delineate swedish fine-grained soil layers using clay sensitivity. CATENA 214:106289. https://doi.org/10.1016/j.catena.2022.106289
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflicts of interest that could potentially influence the outcome or interpretation of the research reported in this manuscript.
Competing Interests
The authors declare no competing interests related to the publication of this manuscript.
Informed consent
Informed consent was obtained from all human participants involved in this study.
Research involving Human Participants and/or Animals
This study did not involve human participants or animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, Y., Wang, M. & Ma, Y. Semantic-alignment transformer and adversary hashing for cross-modal retrieval. Appl Intell 54, 7581–7602 (2024). https://doi.org/10.1007/s10489-024-05501-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05501-2