Abstract
In recent years, driven by the increasing number of cross-modal data such as images and texts, cross-modal retrieval has received intensive attention. Great progress has made in deep cross-modal hash retrieval, which integrates feature leaning and hash learning into an end-to-end trainable framework to obtain the better hash codes. However, due to the heterogeneity between images and texts, it is still a challenge to compare the similarity between them. Most previous approaches embed images and texts into a joint embedding subspace independently and then compare their similarity, which ignore the influence of irrelevant regions (regions in images without the corresponding textual description) on cross-modal retrieval and the fine-grained interactions between images and texts. To address these issues, a new cross-modal hashing called Deep Translated Attention Hashing for Cross-Modal Retrieval (DTAH) is proposed. Firstly, DTAH extracts image and text features through the bottom-up attention and the recurrent neural network respectively to reduce the influence of irrelevant regions on cross-modal retrieval. Then, with the help of cross-modal attention module, DTAH captures the fine-grained interactions between vision and language at region level and word level, and then embeds the text features into the image feature space. In this way, the proposed DTAH effectively shrinks the heterogeneity between images and texts, and can learn the discriminative hash codes. Extensive experiments on three benchmark data sets demonstrate that DTAH surpasses the state-of-the-art methods.
Similar content being viewed by others
References
Alphonse AS, Mary NAB, Starvin MS (2020) Classification of membrane protein using tetra peptide pattern. Anal Biochem 606:113845
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: proceedings of the 30th international conference on international conference on machine learning, pp 1247–1255.
Cadene R, Ben-younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1989-1998.
Cao Y, Long M, Wang J, Yang Q, Yu P S (2016) Deep visual-semantic hashing for cross-modal retrieval. In: proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1445-1454.
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets, arXiv preprint arXiv:1405.3531.
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, 48
Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555
Ding G, Guo Y, Zhou J (2014) collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075-2082.
Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes M, Morales EF, Sucar LE, Villaseñor L, Grubinger M (2010) The segmented and annotated iapr tc-12 benchmark. Comput Vis Image Underst 114(4):419–428
Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4438–4446.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: proceedings of the 27th international conference on neural information processing systems, pp 2672-2680.
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neuralcomputation 9(8):1735–1780
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141.
Huang P-Y, Vaibhav, Chang X, Hauptmann AG (2019) Improving what cross-modal retrieval models learn through object-oriented inter- and intra-modal attention networks. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, 244–252
Huiskes M J, Lew M S (2008) The mir flickr retrieval evaluation. In: proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39-43
Irie G, Arai H, Taniguchi Y (2015) Alternating co-quantization for cross-modal hashing. In: proceedings of the 2015 IEEE international conference on computer vision (ICCV), pp 1886–1894.
Jayapriya K, Mary NAB (2019) Employing a novel 2-gram subgroup intra pattern (2gsip) with stacked auto encoder for membrane protein classification. Mol Biol Rep 46(2):2259–2272
Jayapriya K, Jacob IJ, Mary NAB (2020) Person re-identification using prioritized chromatic texture (pct) with deep learning. Multimed Tools Appl 79(39):29399–29410
Jiang Q, Li W (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3232–3240.
Jin L, Shu X, Li K, Li Z, Qi G-J, Tang J (2019) Deep ordinal hashing with spatial attention. IEEE Trans Image Process 28(5):2173–2186
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: proceedings of the 22nd international joint conference on artificial intelligence, pp 1360-1365.
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4242–4251.
Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083
Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128(8):2265–2278
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: 2015 IEEE conference on computer vision and pattern recognition, pp 3864-3872
Liu W, Mu C, Kumar S, Chang S-F (2014) Discrete graph hashing. In: proceedings of the 27th international conference on neural information processing systems, 3419-3427
Liu H, Ji R, Wu Y, Hua G (2016) Supervised matrix factorization for cross-modality hashing. In: proceedings of the 25th international joint conference on artificial intelligence, pp 1767-1773.
Luong M-T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025
Peng H, He J, Chen S, Wang Y, Qiao Y (2019) Dual-supervised attention network for deep cross-modal hashing. Pattern Recogn Lett 128:333–339
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
Song J, Yang Y, Yang Y, Huang Z, Shen H-T (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: proceedings of the 2013 ACM SIGMOD international conference on Management of Data, pp 785-796
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: proceedings of the 24th international conference on artificial intelligence, pp 3890-3896
Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In: proceedings of the 2017 ACM on multimedia conference, pp 154-162.
Wu L, Wang Y, Shao L (2019) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612
Wu J, Weng W, Fu J, Liu L, Hu B (2021) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput & Applic 34:5397–5416. https://doi.org/10.1007/s00521-021-06696-y
Xiong H, He Z, Hu X, Wu H (2018) Multi-channel encoder for neural machine translation. In: 32nd AAAI conference on artificial intelligence, pp 4962-4969
Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: proceedings of the 31st AAAI conference on artificial intelligence, pp 1618-1625
Yang X, Liu W, Liu W, Tao D (2021) A survey on canonical correlation analysis. IEEE Trans Knowl Data Eng 33(6):2349–2368
Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 10502–10511
Zhang D, Li W-J (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: proceedings of the 28th AAAI conference on artificial intelligence, pp 2177-2183.
Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: European Conference on Computer Vision, 591–606, Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant No. 62020106011, 61828105, Chen Guang Project supported by Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant No.17CG41.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yu, H., Ma, R., Su, M. et al. A novel deep translated attention hashing for cross-modal retrieval. Multimed Tools Appl 81, 26443–26461 (2022). https://doi.org/10.1007/s11042-022-12860-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12860-w