[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3460426.3463632acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval

Published: 01 September 2021 Publication History

Abstract

Learning the common representations for various modalities of data is the key component in cross-modal retrieval. Most existing deep approaches learn multiple networks to independently project each sample into a common representation. However, each representation is only extracted from the corresponding data, which totally ignores the relationships between other data. Thus it is challenging to learn efficient common representations when lacking sufficient supervised multi-modal data for training, e.g., few-shot cross-modal retrieval. How to efficiently exploit the information contained in other examples is underexplored. In this work, we present the Self-Others Net, a few-shot cross-modal retrieval model that fully exploits information contained both in its own and other samples. First, we propose a self-network to fully exploit the correlations that lurk in the data itself. It integrates the features at different layers and extracts the multi-level information in the self-network. Second, an others-network is further proposed to model the relationships among all samples, which learns the Mahalanobis tensor and mixes the prototypes of all data to capture the non-linear dependencies for common representation learning. Extensive experiments are conducted on three benchmark datasets, which demonstrate clear improvements of the proposed method over the state-of-the-arts.

References

[1]
Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum. 2019. Infinite Mixture Prototypes for Few-Shot Learning. arXiv preprint arXiv:1902.04552 (2019).
[2]
Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside outside net: Detecting objects in context with skip pooling and recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 2874--2883.
[3]
Michael M Bronstein, Alexander M Bronstein, Fabrice Michel, and Nikos Paragios. 2010. Data fusion through cross-modality metric learning using similarity sensitive hashing. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3594--3601.
[4]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1445--1454.
[5]
Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A Closer Look at Few-shot Classification. In International Conference on Learning Representations.
[6]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A Real-world Web Image Database from National University of Singapore. In ACM International Conference on Image and Video Retrieval. Article 48, 9 pages.
[7]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In IEEE conference on computer vision and pattern recognition. 2075--2082.
[10]
Hugo Jair Escalante, Carlos A Hernández, Jesus A Gonzalez, Aurelio López-López, Manuel Montes, Eduardo F Morales, L Enrique Sucar, Luis Villaseñor, and Michael Grubinger. 2010. The segmented and annotated IAPR TC-12 benchmark. Computer Vision and Image Understanding 114, 4 (2010), 419--428.
[11]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.
[12]
Spyros Gidaris and Nikos Komodakis. 2018. Dynamic Few-Shot Visual Learning Without Forgetting. In IEEE Conference on Computer Vision and Pattern Recognition. 4367--4375.
[13]
Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In IEEE International Conference on Multimedia and Expo. IEEE, 1153--1158.
[14]
Yan Huang and Liang Wang. 2019. ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching. In IEEE International Conference on Computer Vision. 5774--5783.
[15]
Mark J Huiskes and Michael S Lew. 2008. The MIR flickr retrieval evaluation. In ACM International Conference on Multimedia Information Retrieval. ACM, 39--43.
[16]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In IEEE conference on computer vision and pattern recognition. 3232--3240.
[17]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning Deep Learning Workshop, Vol. 2.
[18]
Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 3270--3278.
[19]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.
[20]
Zijia Lin, Guiguang Ding, Jungong Han, and Jianmin Wang. 2016. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics 47, 12 (2016), 4342--4355.
[21]
Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. 2015. Semantics preserving hashing for cross-view retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 3864--3872.
[22]
Xuanwu Liu, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Yazhou Ren, and Maozu Guo. 2019. Ranking-based deep cross-modal hashing. In AAAI Conference on Artificial Intelligence, Vol. 33. 4400--4407.
[23]
Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order metalearning algorithms. arXiv preprint arXiv:1803.02999 (2018).
[24]
Frederik Pahde, Patrick Jähnichen, Tassilo Klein, and Moin Nabi. 2018. Crossmodal hallucination for few-shot fine-grained recognition. arXiv preprint arXiv:1806.05147 (2018).
[25]
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 3 (2014), 521--535.
[26]
Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for few-shot learning. (2016).
[27]
Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[28]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. 4077--4087.
[29]
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926--934.
[30]
Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In IEEE International Conference on Computer Vision. 3027--3035.
[31]
Liang Sun, Shuiwang Ji, and Jieping Ye. 2008. A least squares formulation for canonical correlation analysis. In International Conference on Machine Learning. ACM, 1024--1031.
[32]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition. 1199--1208.
[33]
Luming Tang, Davis Wertheimer, and Bharath Hariharan. 2020. Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition. arXiv preprint arXiv:2004.00705 (2020).
[34]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM International Conference on Multimedia. ACM, 154--162.
[35]
XinWang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In IEEE Conference on Computer Vision and Pattern Recognition. 6629--6638.
[36]
Kilian Q. Weinberger and Lawrence K. Saul. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10 (2009), 207--244.
[37]
Botong Wu, Qiang Yang, Wei-Shi Zheng, Yizhou Wang, and Jingdong Wang. 2015. Quantized correlation hashing for fast cross-modal search. In International Joint Conference on Artificial Intelligence.
[38]
Abhinav Gupta Xiaolong Wang, Ross Girshick and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 2,4.
[39]
Chen Xing, Negar Rostamzadeh, Boris N Oreshkin, and Pedro O Pinheiro. 2019. Adaptive Cross-Modal Few-Shot Learning. arXiv preprint arXiv:1902.07104 (2019).
[40]
Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657--672.
[41]
Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In AAAI Conference on Artificial Intelligence. 1618--1625.
[42]
Wei Yu, Xiaoshuai Sun, Kuiyuan Yang, Yong Rui, and Hongxun Yao. 2018. Hierarchical semantic image matching using CNN feature pyramid. Computer Vision and Image Understanding 169 (2018), 40--51.
[43]
Dongqing Zhang and Wu-Jun Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI Conference on Artificial Intelligence.
[44]
Xi Zhang, Hanjiang Lai, and Jiashi Feng. 2018. Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval. In European Conference on Computer Vision. 591--606.
[45]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 10394--10403.

Index Terms

  1. Know Yourself and Know Others: Efficient Common Representation Learning for Few-shot Cross-modal Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
    August 2021
    715 pages
    ISBN:9781450384636
    DOI:10.1145/3460426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. few-shot learning
    3. representation learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICMR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 256
      Total Downloads
    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media