[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3123266.3123326acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Adversarial Cross-Modal Retrieval

Published: 19 October 2017 Publication History

Abstract

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

References

[1]
G. Andrew, R. Arora, J. Bilmes, and K. Livescu. 2013. Deep canonical correlation analysis. In ICML. 1247--1255.
[2]
X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger. 2017. Adversarial deep averaging networks for cross-lingual sentiment classification. (2017). {arxiv}1406.2661v1
[3]
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore CIVR.
[4]
J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. TPAMI, Vol. 36, 3 (2014), 521--535.
[5]
F. Feng, X. Wang, and R. Li. 2014. Cross-modal retrieval with correspondence autoencoder. ACM MM. 7--16.
[6]
Y. Ganin and V. Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. ICML. 1180--1189.
[7]
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.
[8]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In NIPS. 2672--2680.
[9]
D. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, Vol. 16, 12 (2004), 2639--2664.
[10]
L. He, X. Xu, H. Lu, Y. Yang, F. Shen, and H. T. Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning ICME.
[11]
R. Hong, Y. Yang, M. Wang, and X.-S. Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. TBD, Vol. 1, 4 (2015), 152--161.
[12]
A. Karpathy and L. Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions CVPR. 3128--3137.
[13]
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In ICLR.
[14]
B. Klein, G. Lev, G. Sadeh, and L. Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors CVPR. 4437--4446.
[15]
D. Li, N. Dimitrova, M. Li, and I.K. Sethi. 2003. Multimedia content processing through cross-modal association ACM MM. 604--611.
[16]
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.
[17]
Z.-C. Lipton and S. Tripathi. 2017. Precise recovery of latent vectors from generative adversarial networks ICLR Workshop.
[18]
L. Ma, Z. Lu, L. Shang, and H. Li. 2015. Multimodal convolutional neural networks for matching image and sentence ICCV. 2380--7504.
[19]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn) ICLR.
[20]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689--696.
[21]
Y. Peng, X. Huang, and J. Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks IJCAI. 3846--3853.
[22]
Y. Peng, J. Qi, X. Huang, and Yuan Y. 2017. CCL: Cross-modal correlation learning with multi-grained fusion by hierarchical network. (2017). {arxiv}1704.02116
[23]
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon's Mechanical Turk NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.
[24]
F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. 2016 b. A fast optimization method for general binary code learning. TIP, Vol. 25, 12 (2016), 5610--5621.
[25]
X. Shen, F. Shen, Q.-S. Sun, Y. Yang, Y. Yuan, and H. T. Shen. 2016 a. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. TCYB Vol. PP (2016), 1--14.
[26]
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-Scale image recognition. CoRR Vol. abs/1409.1556 (2014).
[27]
J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources SIGMOD. 785--796.
[28]
N. Srivastava and R. Salakhutdinov. 2012 a. Learning representations for multimodal data with deep belief nets ICML Workshop.
[29]
N. Srivastava and R. Salakhutdinov. 2012 b. Multimodal learning with deep boltzmann machines. NIPS. 2222--2230.
[30]
K. Wang, R. He, L. Wang, W. Wang, and T. Tan. 2011. Joint feature selection and subspace learning for cross-modal retrieval. TPAMI, Vol. 38, 10 (2011), 2010--2023.
[31]
K. Wang, R. He, W. Wang, L. Wang, and T. Tan. 2013. Learning coupled feature spaces for cross-modal matching ICCV. 2088--2095.
[32]
K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang. 2017. A comprehensive survey on cross-modal retrieval. {arxiv}1607.06215
[33]
L. Wang, Y. Li, and S. Lazebnik. 2016. Learning deep structure-preserving image-text embeddings CVPR. 5005--5013.
[34]
X. Xu, F. Shen, Yang Y., H. T. Shen, and X. Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. TIP, Vol. 26, 5 (2017), 2494--2507.
[35]
X. Xu, A. Shimada, R. Taniguchi, and L. He. 2015. Coupled dictionary learning and feature mapping for cross-modal retrieval. ICME. 1--6.
[36]
F. Yan and K. Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.
[37]
Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. 2016. Zero-shot hashing via transferring supervised knowledge ACM MM. 1286--1295.
[38]
Y. Yang, Z. Zha, Y. Gao, X. Zhu, and T.-S. Chua. 2014. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, Vol. 16, 6 (2014), 1677--1689.
[39]
T. Yao, T. Mei, and C.-W. Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis ICCV. 28--36.
[40]
X. Zhai, Y. Peng, and J. Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. TCSVT Vol. 24 (2014), 965--978.
[41]
Y. Zhang, R. Barzilay, and T. Jaakkola. 2017. Aspect-augmented adversarial networks for domain adaptation. (2017). showeprint{arxiv}1701.00188
[42]
X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. 2013. Linear cross-modal hashing for efficient multimedia search ACM MM. 143--152.
[43]
Y. Zhuang, Y. Wang, F. Wu, Y. Zhang, and W. Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval AAAI. 1070--1076.

Cited By

View all
  • (2025)Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attentionNeurocomputing10.1016/j.neucom.2024.129003618(129003)Online publication date: Feb-2025
  • (2025)Modality-specific adaptive scaling and attention network for cross-modal retrievalNeurocomputing10.1016/j.neucom.2024.128664612(128664)Online publication date: Jan-2025
  • (2024)Supervised Contrastive Learning for 3D Cross-Modal RetrievalApplied Sciences10.3390/app14221032214:22(10322)Online publication date: 10-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial learning
  2. cross-modal retrieval
  3. modality gap

Qualifiers

  • Research-article

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)353
  • Downloads (Last 6 weeks)43
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Select & Re-Rank: Effectively and efficiently matching multimodal data with dynamically evolving attentionNeurocomputing10.1016/j.neucom.2024.129003618(129003)Online publication date: Feb-2025
  • (2025)Modality-specific adaptive scaling and attention network for cross-modal retrievalNeurocomputing10.1016/j.neucom.2024.128664612(128664)Online publication date: Jan-2025
  • (2024)Supervised Contrastive Learning for 3D Cross-Modal RetrievalApplied Sciences10.3390/app14221032214:22(10322)Online publication date: 10-Nov-2024
  • (2024)Soft Contrastive Cross-Modal RetrievalApplied Sciences10.3390/app1405194414:5(1944)Online publication date: 27-Feb-2024
  • (2024)Gut microbiome-metabolome interactions predict host conditionMicrobiome10.1186/s40168-023-01737-112:1Online publication date: 10-Feb-2024
  • (2024)A method for image–text matching based on semantic filtering and adaptive adjustmentJournal on Image and Video Processing10.1186/s13640-024-00639-y2024:1Online publication date: 29-Aug-2024
  • (2024)Cross-modal Retrieval Based on Multi-modal Large Model With Convolutional Attention and Adversarial TrainingProceedings of the First International Workshop on IoT Datasets for Multi-modal Large Model10.1145/3698385.3699877(50-56)Online publication date: 4-Nov-2024
  • (2024)Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681577(701-709)Online publication date: 28-Oct-2024
  • (2024)Anchor-aware Deep Metric Learning for Audio-visual RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658067(211-219)Online publication date: 30-May-2024
  • (2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media