More Web Proxy on the site http://driver.im/

research-article

Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval

Authors:

Zhenyu ShiAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 303 - 311

https://doi.org/10.1145/3460426.3463632

Published: 01 September 2021 Publication History

Abstract

Learning the common representations for various modalities of data is the key component in cross-modal retrieval. Most existing deep approaches learn multiple networks to independently project each sample into a common representation. However, each representation is only extracted from the corresponding data, which totally ignores the relationships between other data. Thus it is challenging to learn efficient common representations when lacking sufficient supervised multi-modal data for training, e.g., few-shot cross-modal retrieval. How to efficiently exploit the information contained in other examples is underexplored. In this work, we present the Self-Others Net, a few-shot cross-modal retrieval model that fully exploits information contained both in its own and other samples. First, we propose a self-network to fully exploit the correlations that lurk in the data itself. It integrates the features at different layers and extracts the multi-level information in the self-network. Second, an others-network is further proposed to model the relationships among all samples, which learns the Mahalanobis tensor and mixes the prototypes of all data to capture the non-linear dependencies for common representation learning. Extensive experiments are conducted on three benchmark datasets, which demonstrate clear improvements of the proposed method over the state-of-the-arts.

References

[1]

Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum. 2019. Infinite Mixture Prototypes for Few-Shot Learning. arXiv preprint arXiv:1902.04552 (2019).

[2]

Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside outside net: Detecting objects in context with skip pooling and recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 2874--2883.

[3]

Michael M Bronstein, Alexander M Bronstein, Fabrice Michel, and Nikos Paragios. 2010. Data fusion through cross-modality metric learning using similarity sensitive hashing. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3594--3601.

[4]

Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1445--1454.

Digital Library

[5]

Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A Closer Look at Few-shot Classification. In International Conference on Learning Representations.

[6]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A Real-world Web Image Database from National University of Singapore. In ACM International Conference on Image and Video Retrieval. Article 48, 9 pages.

Digital Library

[7]

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In IEEE conference on computer vision and pattern recognition. 2075--2082.

Digital Library

[10]

Hugo Jair Escalante, Carlos A Hernández, Jesus A Gonzalez, Aurelio López-López, Manuel Montes, Eduardo F Morales, L Enrique Sucar, Luis Villaseñor, and Michael Grubinger. 2010. The segmented and annotated IAPR TC-12 benchmark. Computer Vision and Image Understanding 114, 4 (2010), 419--428.

Digital Library

[11]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.

[12]

Spyros Gidaris and Nikos Komodakis. 2018. Dynamic Few-Shot Visual Learning Without Forgetting. In IEEE Conference on Computer Vision and Pattern Recognition. 4367--4375.

[13]

Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In IEEE International Conference on Multimedia and Expo. IEEE, 1153--1158.

[14]

Yan Huang and Liang Wang. 2019. ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching. In IEEE International Conference on Computer Vision. 5774--5783.

[15]

Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In ACM International Conference on Multimedia Information Retrieval. ACM, 39--43.

Digital Library

[16]

Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In IEEE conference on computer vision and pattern recognition. 3232--3240.

[17]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning Deep Learning Workshop, Vol. 2.

[18]

Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 3270--3278.

[19]

Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.

[20]

Zijia Lin, Guiguang Ding, Jungong Han, and Jianmin Wang. 2016. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics 47, 12 (2016), 4342--4355.

[21]

Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. 2015. Semantics preserving hashing for cross-view retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 3864--3872.

[22]

Xuanwu Liu, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Yazhou Ren, and Maozu Guo. 2019. Ranking-based deep cross-modal hashing. In AAAI Conference on Artificial Intelligence, Vol. 33. 4400--4407.

Digital Library

[23]

Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order metalearning algorithms. arXiv preprint arXiv:1803.02999 (2018).

[24]

Frederik Pahde, Patrick Jähnichen, Tassilo Klein, and Moin Nabi. 2018. Crossmodal hallucination for few-shot fine-grained recognition. arXiv preprint arXiv:1806.05147 (2018).

[25]

Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 521--535.

Digital Library

[26]

Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for few-shot learning. (2016).

[27]

Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[28]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. 4077--4087.

[29]

Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.

[30]

Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In IEEE International Conference on Computer Vision. 3027--3035.

[31]

Liang Sun, Shuiwang Ji, and Jieping Ye. 2008. A least squares formulation for canonical correlation analysis. In International Conference on Machine Learning. ACM, 1024--1031.

Digital Library

[32]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition. 1199--1208.

[33]

Luming Tang, Davis Wertheimer, and Bharath Hariharan. 2020. Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition. arXiv preprint arXiv:2004.00705 (2020).

[34]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM International Conference on Multimedia. ACM, 154--162.

Digital Library

[35]

XinWang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In IEEE Conference on Computer Vision and Pattern Recognition. 6629--6638.

[36]

Kilian Q. Weinberger and Lawrence K. Saul. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10 (2009), 207--244.

Digital Library

[37]

Botong Wu, Qiang Yang, Wei-Shi Zheng, Yizhou Wang, and Jingdong Wang. 2015. Quantized correlation hashing for fast cross-modal search. In International Joint Conference on Artificial Intelligence.

[38]

Abhinav Gupta Xiaolong Wang, Ross Girshick and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 2,4.

[39]

Chen Xing, Negar Rostamzadeh, Boris N Oreshkin, and Pedro O Pinheiro. 2019. Adaptive Cross-Modal Few-Shot Learning. arXiv preprint arXiv:1902.07104 (2019).

[40]

Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657--672.

Digital Library

[41]

Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In AAAI Conference on Artificial Intelligence. 1618--1625.

[42]

Wei Yu, Xiaoshuai Sun, Kuiyuan Yang, Yong Rui, and Hongxun Yao. 2018. Hierarchical semantic image matching using CNN feature pyramid. Computer Vision and Image Understanding 169 (2018), 40--51.

[43]

Dongqing Zhang and Wu-Jun Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI Conference on Artificial Intelligence.

[44]

Xi Zhang, Hanjiang Lai, and Jiashi Feng. 2018. Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval. In European Conference on Computer Vision. 591--606.

[45]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 10394--10403.

Index Terms

Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures

Recommendations

Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...
Representation separation adversarial networks for cross-modal retrieval
Abstract
Cross-modal retrieval aims to search the semantically similar instances from the other modalities by giving a query from one modality. Recently, generative adversarial networks (GANs) has been proposed to model the joint distribution over the data ...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
256
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)7

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents