More Web Proxy on the site http://driver.im/

research-article

Adversarial Cross-Modal Retrieval

Authors:

Heng Tao ShenAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 154 - 162

https://doi.org/10.1145/3123266.3123326

Published: 19 October 2017 Publication History

Abstract

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

References

[1]

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In ICML. 1247--1255.

Digital Library

[2]

X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger. 2017. Adversarial deep averaging networks for cross-lingual sentiment classification. (2017). {arxiv}1406.2661v1

[3]

T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore CIVR.

Digital Library

[4]

J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. TPAMI, Vol. 36, 3 (2014), 521--535.

Digital Library

[5]

F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. ACM MM. 7--16.

Digital Library

[6]

Y. Ganin and V. Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. ICML. 1180--1189.

Digital Library

[7]

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.

Digital Library

[8]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.

Digital Library

[9]

D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, Vol. 16, 12 (2004), 2639--2664.

Digital Library

[10]

L. He, X. Xu, H. Lu, Y. Yang, F. Shen, and H. T. Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning ICME.

[11]

R. Hong, Y. Yang, M. Wang, and X.-S. Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. TBD, Vol. 1, 4 (2015), 152--161.

[12]

A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions CVPR. 3128--3137.

[13]

D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In ICLR.

[14]

B. Klein, G. Lev, G. Sadeh, and L. Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors CVPR. 4437--4446.

[15]

D. Li, N. Dimitrova, M. Li, and I.K. Sethi. 2003. Multimedia content processing through cross-modal association ACM MM. 604--611.

Digital Library

[16]

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.

[17]

Z.-C. Lipton and S. Tripathi. 2017. Precise recovery of latent vectors from generative adversarial networks ICLR Workshop.

[18]

L. Ma, Z. Lu, L. Shang, and H. Li. 2015. Multimodal convolutional neural networks for matching image and sentence ICCV. 2380--7504.

Digital Library

[19]

J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn) ICLR.

[20]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689--696.

Digital Library

[21]

Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks IJCAI. 3846--3853.

Digital Library

[22]

Y. Peng, J. Qi, X. Huang, and Yuan Y. 2017. CCL: Cross-modal correlation learning with multi-grained fusion by hierarchical network. (2017). {arxiv}1704.02116

[23]

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon's Mechanical Turk NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.

Digital Library

[24]

F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. 2016 b. A fast optimization method for general binary code learning. TIP, Vol. 25, 12 (2016), 5610--5621.

Digital Library

[25]

X. Shen, F. Shen, Q.-S. Sun, Y. Yang, Y. Yuan, and H. T. Shen. 2016 a. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. TCYB Vol. PP (2016), 1--14.

[26]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-Scale image recognition. CoRR Vol. abs/1409.1556 (2014).

[27]

J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources SIGMOD. 785--796.

Digital Library

[28]

N. Srivastava and R. Salakhutdinov. 2012 a. Learning representations for multimodal data with deep belief nets ICML Workshop.

[29]

N. Srivastava and R. Salakhutdinov. 2012 b. Multimodal learning with deep boltzmann machines. NIPS. 2222--2230.

Digital Library

[30]

K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. TPAMI, Vol. 38, 10 (2011), 2010--2023.

Digital Library

[31]

K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching ICCV. 2088--2095.

Digital Library

[32]

K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang. 2017. A comprehensive survey on cross-modal retrieval. {arxiv}1607.06215

[33]

L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings CVPR. 5005--5013.

[34]

X. Xu, F. Shen, Yang Y., H. T. Shen, and X. Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. TIP, Vol. 26, 5 (2017), 2494--2507.

Digital Library

[35]

X. Xu, A. Shimada, R. Taniguchi, and L. He. 2015. Coupled dictionary learning and feature mapping for cross-modal retrieval. ICME. 1--6.

[36]

F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.

[37]

Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. 2016. Zero-shot hashing via transferring supervised knowledge ACM MM. 1286--1295.

Digital Library

[38]

Y. Yang, Z. Zha, Y. Gao, X. Zhu, and T.-S. Chua. 2014. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, Vol. 16, 6 (2014), 1677--1689.

[39]

T. Yao, T. Mei, and C.-W. Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis ICCV. 28--36.

Digital Library

[40]

X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. TCSVT Vol. 24 (2014), 965--978.

[41]

Y. Zhang, R. Barzilay, and T. Jaakkola. 2017. Aspect-augmented adversarial networks for domain adaptation. (2017). showeprint{arxiv}1701.00188

[42]

X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. 2013. Linear cross-modal hashing for efficient multimedia search ACM MM. 143--152.

Digital Library

[43]

Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval AAAI. 1070--1076.

Digital Library

Cited By

Yang XLi CWang ZXie HMao JYin G(2025)Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and FilteringRemote Sensing10.3390/rs1703050317:3(503)Online publication date: 31-Jan-2025
https://doi.org/10.3390/rs17030503
Lai LChen JZhang ZLin GWu Q(2025)CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D ReconstructionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.338303936:3(5522-5534)Online publication date: Mar-2025
https://doi.org/10.1109/TNNLS.2024.3383039
Jiang YHua CFeng YGao Y(2025)Hierarchical Set-to-Set Representation for 3-D Cross-Modal RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332658136:1(1302-1314)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3326581
Show More Cited By

Index Terms

Adversarial Cross-Modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Self-supervised adversarial learning for cross-modal retrieval
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-...
Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity
Abstract
Cross-modal retrieval aims to search the semantically similar instances from the other modalities given a query from one modality. However, the differences of the distributions and representations between different modalities make that the ...
Multimodal adversarial network for cross-modal retrieval
Abstract
Cross-modal retrieval aims to retrieve the pertinent samples across different modalities, which is important in numerous multimodal applications. It is challenging to correlate the multimodal data due to a large heterogeneous gap ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

574
Total Citations
View Citations
4,347
Total Downloads

Downloads (Last 12 months)286
Downloads (Last 6 weeks)19

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang XLi CWang ZXie HMao JYin G(2025)Remote Sensing Cross-Modal Text-Image Retrieval Based on Attention Correction and FilteringRemote Sensing10.3390/rs1703050317:3(503)Online publication date: 31-Jan-2025
https://doi.org/10.3390/rs17030503
Lai LChen JZhang ZLin GWu Q(2025)CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D ReconstructionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.338303936:3(5522-5534)Online publication date: Mar-2025
https://doi.org/10.1109/TNNLS.2024.3383039
Jiang YHua CFeng YGao Y(2025)Hierarchical Set-to-Set Representation for 3-D Cross-Modal RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332658136:1(1302-1314)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3326581
Guo WKong XHuang H(2025)Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attentionNeurocomputing10.1016/j.neucom.2024.129003618(129003)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129003
Ke XChen BCai YLiu HGuo WChen W(2025)Modality-specific adaptive scaling and attention network for cross-modal retrievalNeurocomputing10.1016/j.neucom.2024.128664612(128664)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128664
Zeng ZHe SZhang YMao W(2025)A Multimodal Embedding Transfer Approach for Consistent and Selective Learning Processes in Cross-Modal RetrievalInformation Sciences10.1016/j.ins.2025.121974(121974)Online publication date: Feb-2025
https://doi.org/10.1016/j.ins.2025.121974
Li ZLu HFu HMeng FGu G(2025)Csan: cross-coupled semantic adversarial network for cross-modal retrievalArtificial Intelligence Review10.1007/s10462-025-11152-758:5Online publication date: 1-Mar-2025
https://doi.org/10.1007/s10462-025-11152-7
Zhang JYu YTang SQi GWu HHachiya H(2025)Enhancing semantic audio-visual representation learning with supervised multi-scale attentionPattern Analysis and Applications10.1007/s10044-025-01414-z28:2Online publication date: 11-Feb-2025
https://doi.org/10.1007/s10044-025-01414-z
Liu LGao KLi CRida ITeng SFei L(2025)Semantic Cross-Self-Reconstruction with Graph Convolutional Network for Zero-Shot Cross-Modal RetrievalAdvances in Computer Graphics10.1007/978-3-031-81806-6_14(186-198)Online publication date: 27-Feb-2025
https://doi.org/10.1007/978-3-031-81806-6_14
Shi BLi Z(2024)Rethinking Semantic Contrastive Learning and Content Fusion in Multimodal RetrievalJournal of Computing and Electronic Information Management10.54097/flk7za7l14:2(4-9)Online publication date: 28-Sep-2024
https://doi.org/10.54097/flk7za7l
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten