[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3548391acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning

Published: 10 October 2022 Publication History

Abstract

Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.

Supplementary Material

MP4 File (MM22-fp2969.mp4)
Nowadays, there are massive cross-media data on the Internet, and the descriptions of different media data are complementary. Therefore, more information across different modalities can be obtained by cross-media search. However, existing methods are weak in semantic distinction and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN), which combines fine-grained attention semantic features and semantic structures to mine implicit cross-media semantic associations. Especially, we propose a cross-media and intra-media contrastive adversarial representation learning mechanism to further enhance semantic discriminativeness of different modal representations, and develop a dual-way adversarial learning strategy to maximize cross-media semantic associations. Extensive experiments on several benchmark datasets demonstrate that SCAHN outperforms state-of-the-art methods.

References

[1]
Vaswani A, Shazeer N, Parmar N, and et al. 2017. Attention is all you need. In NIPS. 6000--6010.
[2]
M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR. 3594--3601.
[3]
Li C, Deng C, and Wang L. 2019. Coupled cycleGAN: unsupervised hashing network for cross-modal retrieval. In AAAI. 176--183.
[4]
Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. 2017. Collective deep quantization for efficient cross-modal retrieval. In AAAI. 3974--3980.
[5]
Li Chao, Deng Cheng, Li Ning, Liu Wei, Gao Xinbo, and Tao Dacheng. 2018. Selfsupervised adversarial hashing networks for cross-modal retrieval. In CVPR. 4242--4251.
[6]
Yudong Chen, Sen Wang, Jianglin Lu, Zhi Chen, Zheng Zhang, and Zi Huang. 2021. Local graph convolutional networks for cross-modal hashing. In ACM MM. 1921--1928.
[7]
Jingze Chi and Yuxin Peng. 2020. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE T CIRC SYST VID 30, 4 (2020), 862--876.
[8]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from national university of singapore. In CIVR. 1--9.
[9]
Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia. 2021. Parametric contrastive learning. In ICCV. 695--704.
[10]
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. 2021. With a little help from my friends: nearest-neighbor contrastive learning of visual representations. In ICCV. 9568--9577.
[11]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM MM. 7--16.
[12]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, X. Bing, and Y. Bengio. 2014. Generative adversarial nets. MIT Press (2014).
[13]
Kaiming He, Haoqi Fan, YuxinWu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729--9738.
[14]
Y. Hu, L. Zheng, Y. Yang, and Y. Huang. 2018. Twitter100k: a real-world dataset for weakly supervised cross-Media retrieval. IEEE T Multimedia 20, 4 (2018), 927--938.
[15]
Mark J. Huiskes and Michael S. Lew. 2008. The MIR Flickr retrieval evaluation. In MIR. 39--43.
[16]
Yuqi Huo, Manli Zhang, Guangzhen Liu, and Haoyu Lu. 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. In CoRR abs/2103.06561.
[17]
Devlin J, Chang Mingwei, Lee K, and et al. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[18]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In CVPR. 3270--3278.
[19]
Chen Jiayi and Zhang Aidong. 2020. HGMF: heterogeneous graph-based fusion for multimodal data with incompleteness. In KDD. 1295--1305.
[20]
Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR. 1--14.
[21]
Z Lin, G Ding, and J Wang. 2015. Semantics preserving hashing for cross-view retrieval. In CVPR. 3864--3872.
[22]
H. Liu, Y. Feng, M. Zhou, and B. Qiang. 2020. Semantic ranking structure preserving for cross-modal retrieval. Applied Intelligence 1 (2020), 1--11.
[23]
Jiawei Liu, Zheng-Jun Zha, and Richang Hong. 2019. Deep adversarial graph attention convolution network for text-based person search. In ACM MM. 665--673.
[24]
M. Long, Y. Cao, J. Wang, and P. S. Yu. 2016. Composite correlation quantization for efficient multimodal retrieval. ACM SIGIR (2016), 579--588.
[25]
Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE T Multimedia 22, 12 (2020), 3101--3114.
[26]
Liang Meiyu, Du Junping, and Yang Congxian. 2020. Cross-media semantic correlation learning based on deep hash network and semantic expansion for social network cross-media search. IEEE TNNLS 31, 9 (2020), 3634--3648.
[27]
Y. Peng, X. Huang, and Y. Zhao. 2018. An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE TCSVT 28, 9 (2018), 2372--2385.
[28]
Y. Peng, J. Qi, and Y. Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27, 11 (2018), 5585--5599.
[29]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2019. CM-GANs: cross-modal generative adversarial networks for common representation learning. TOMM 15, 1 (2019), Article No.22,1--24.
[30]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
[31]
Shengsheng Qian, Dizhan Xue, Huaiwen Zhang, Quan Fang, and Changsheng Xu. 2021. Dual adversarial graph neural networks for multi-label cross-modal retrieval. In AAAI. 2440--2448.
[32]
Ren Shaoqing, He Kaiming, Girshick R., and et al. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137--1149.
[33]
Zhanjian Shen, Deming Zhai, Xianming Liu, and Junjun Jiang. 2020. Semi-supervised graph convolutional hashing network for large-scale cross-modal retrieval. In ICIP. 2366--2370.
[34]
Wang Shuhui, Yan Xu, and Huangqingming. 2021. A survey of cross-media analysis and reasoning technology research. Computer Science 48, 3 (2021), 79--86.
[35]
Wolf T, Debut L, Sanh V, and et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[36]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. arXiv:1710.10903 [stat.ML]
[37]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-Modal retrieval. In ACM MM. 154--162.
[38]
D. Wang, X. Gao, X. Wang, and L. He. 2019. Label consistent matrix factorization hashing for large-scale cross-modal similarity search. TPAMI 41, 10 (2019), 2466--2479.
[39]
Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, and S. Yan. 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE T CYBERNETICS 47, 2 (2017), 449--460.
[40]
Gu Wendel, Gu Xiaoyan, Jingzi Gu, Li Bo, Xiong Zhi, and Wang Weiping. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In ICMR. 159--167.
[41]
He Xiaodong, Buehler C, and et al. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.
[42]
De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. In IEEE Trans Image Process, Vol. 29. 3626--3637.
[43]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph convolutional network hashing for cross-modal retrieval. In IJCAI. 982--988.
[44]
Zhe Xue, Junping Du, Dawei Du, and Siwei Lyu. 2019. Deep low-rank subspace ensemble for multi-view clustering. Information Sciences. Information Sciences 482 (2019), 210--227.
[45]
Shi Y, You X, and Zheng F. 2019. Equally-guided discriminative hashing for crossmodal retrieval. In AAAI. 4767--4773.
[46]
Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise relationship guided deep hashing for cross-modal retrieval. In AAAI. 1618--1625.
[47]
Jun Yu, Hao Zhou, Yibing Zhan, and Dacheng Tao. 2021. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In AAAI. 4626--4634.
[48]
Xin Yuan, Zhe Lin, Jason Kuen, and el al. 2021. Multimodal contrastive training for visual representation learning. In CVPR. 6995--7004.
[49]
D. Zhang andW.-J. Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI. 2177--2183.
[50]
Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018. Unsupervised generative adversarial cross-modal hashing. In AAAI. 539--546.
[51]
Pengfei Zhang, Jiasheng Duan, Zi Huang, and Hongzhi Yin. 2021. Joint-teaching: learning to refine knowledge for resource-constrained unsupervised cross-modal retrieval. In ACM MM. 1517--1525.
[52]
Xi Zhang, Siyu Zhou, Jiashi Feng, Hanjiang Lai, Bo Li, Yan Pan, Jian Yin, and Shuicheng Yan. 2018. HashGAN: attention-aware deep adversarial hashing for cross modal retrieval. In ECCV.
[53]
Y Zhuang, Z Yu, WWang, F Wu, S Tang, and J Shao. 2014. Cross-media hashing with neural networks. In ACM MM. 901--904.
[54]
Chen Zhuo, Du Hao,Wu Yufei, Xu Tong, and Chen Enhong. 2020. Cross-modal video clip retrieval based on visual-text relationship alignment. SCI CHINA INFORM SCI 50, 6 (2020), 862--876.
[55]
Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. CrossCLR: cross-modal contrastive learning for multi-modal video representations. In ICCV. 1430--1439.

Cited By

View all
  • (2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
  • (2024)Invisible Black-Box Backdoor Attack against Deep Cross-Modal Hashing RetrievalACM Transactions on Information Systems10.1145/365020542:4(1-27)Online publication date: 26-Apr-2024
  • (2024)Deep Ranking Distribution Preserving Hashing for Robust Multi-Label Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335899526(7027-7042)Online publication date: 26-Jan-2024
  • Show More Cited By

Index Terms

  1. Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contrastive adversarial hash network
    2. cross-media and intra-media contrastive learning
    3. cross-media representation learning
    4. cross-media search
    5. graph convolutional network

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • CAAI-Huawei MindSpore Open Fund

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
    • (2024)Invisible Black-Box Backdoor Attack against Deep Cross-Modal Hashing RetrievalACM Transactions on Information Systems10.1145/365020542:4(1-27)Online publication date: 26-Apr-2024
    • (2024)Deep Ranking Distribution Preserving Hashing for Robust Multi-Label Cross-Modal RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335899526(7027-7042)Online publication date: 26-Jan-2024
    • (2024)Learning Hierarchy-Aware Federated Graph Embedding for Link Prediction2024 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BigComp60711.2024.00059(329-336)Online publication date: 18-Feb-2024
    • (2024)Federated learning for supervised cross-modal retrievalWorld Wide Web10.1007/s11280-024-01249-427:4Online publication date: 26-Jun-2024
    • (2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
    • (2023)Deep Unsupervised Momentum Contrastive Hashing for Cross-modal Retrieval2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00030(126-131)Online publication date: Jul-2023
    • (2023)Multi-level Similarity Complementary Fusion for Unsupervised Cross-Modal Hashing2023 International Conference on Cyber-Physical Social Intelligence (ICCSI)10.1109/ICCSI58851.2023.10303860(150-155)Online publication date: 20-Oct-2023
    • (2023)CPPFEE: A Cascade Pointer Prediction Framework for Financial Event Extraction2023 5th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS60977.2023.10295041(1-8)Online publication date: 22-Sep-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media