[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks across different data modalities. A PFM (e.g., BERT, ChatGPT, GPT-4) is trained on large-scale data, providing a solid parameter initialization for a wide range of downstream applications. In contrast to earlier methods that use convolution and recurrent modules for feature extraction, BERT learns bidirectional encoder representations from Transformers, trained on large datasets as contextual language models. Similarly, the Generative Pretrained Transformer (GPT) method employs Transformers as feature extractors and is trained on large datasets using an autoregressive paradigm. Recently, ChatGPT has demonstrated significant success in large language models, utilizing autoregressive language models with zero-shot or few-shot prompting. The remarkable success of PFMs has driven significant breakthroughs in AI, leading to numerous studies proposing various methods, datasets, and evaluation metrics, which increases the demand for an updated survey. This study provides a comprehensive review of recent research advancements, challenges, and opportunities for PFMs in text, image, graph, and other data modalities. It covers the basic components and existing pretraining methods used in natural language processing, computer vision, and graph learning, while also exploring advanced PFMs for different data modalities and unified PFMs that address data quality and quantity. Additionally, the review discusses key aspects such as model efficiency, security, and privacy, and provides insights into future research directions and challenges in PFMs. Overall, this survey aims to shed light on the research of the PFMs on scalability, security, logical reasoning ability, cross-domain learning ability, and user-friendly interactive ability for artificial general intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Notes

  1. https://openai.com/blog/chatgpt/.

  2. March 2023.

  3. https://blog.google/technology/ai/bard-google-ai-search-updates/.

References

  1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E et al (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258

  2. Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37:51–89

    Article  Google Scholar 

  3. Forsyth D, Ponce J (2011) Computer vision: a modern approach. University of Illinois at Urbana-Champaign, USA

    Google Scholar 

  4. Bondy JA, Murty USR et al (1976) Graph theory with applications. Macmillan, London

    Book  Google Scholar 

  5. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897

  6. Li J, Tang T, Zhao WX, Wen J-R (2021) Pretrained language models for text generation: a survey

  7. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al (2020) A survey on visual transformer. arXiv

  8. Sanchez S, Romero H, Morales A (2020) A review: comparison of performance metrics of pretrained models for object detection using the tensorflow framework. In: IOP conference series: materials science and engineering

  9. Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J (2019) Pre-training graph neural networks. arXiv

  10. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. In: Proceedings of the IEEE

  11. Bengio Y, Ducharme R, Vincent P, Janvin C (2000) A neural probabilistic language model. Adv Neural Inf Proc Syst

  12. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations, ICLR 2013. Workshop track proceedings, Scottsdale, Arizona, USA, May 2–4, 2013

  13. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT

  14. Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. In: NeurIPS

  15. Chen M, Tworek J, Jun H, Yuan Q, Pinto HPdO, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G et al (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  16. Neelakantan A, Xu T, Puri R, Radford A, Han JM, Tworek J, Yuan Q, Tezak N, Kim JW, Hallacy C et al (2022) Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005

  17. Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. Adv Neural Inf Process Syst

  18. Stiennon N, Ouyang L, Wu J, Ziegler D, Lowe R, Voss C, Radford A, Amodei D, Christiano PF (2020) Learning to summarize with human feedback. Adv Neural Inf Process Syst 33:3008–3021

    Google Scholar 

  19. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A., et al. (2022) Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155

  20. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. arXiv

  21. Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 3045–3059

  22. Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 255–269

  23. Zhang Z, Zhang A, Li M, Smola A (2023) Automatic chain of thought prompting in large language models. In: International conference on learning representations

  24. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi EH, Le QV, Zhou D et al (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv Neu Inform Process Syst 35:24824–24837

  25. OpenAI (2023) GPT-4 technical report

  26. Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H (2022) Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052

  27. Lu J, Clark C, Zellers R, Mottaghi R, Kembhavi A (2022) Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916

  28. Singh A, Hu R, Goswami V, Couairon G, Galuba W, Rohrbach M, Kiela D (2022) Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15638–15650

  29. Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S et al (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442

  30. Clark K, Luong M, Le QV, Manning CD (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR

  31. Wallace E, Rodriguez P, Feng S, Yamada I, Boyd-Graber J (2019) Trick me if you can: human-in-the-loop generation of adversarial examples for question answering. Trans Assoc Comput Linguist 7:387–401

  32. Nie Y, Williams A, Dinan E, Bansal M, Weston J, Kiela D (2020) Adversarial NLI: a new benchmark for natural language understanding. In: ACL

  33. Niven T, Kao H (2019) Probing neural network comprehension of natural language arguments. In: ACL

  34. Wang G, Ivanov N, Chen B, Wang Q, Yan Q (2023) Graph learning for interactive threat detection in heterogeneous smart home rule data. In: 2023 ACM SIGMOD international conference on management of data. ACM

  35. Gordon MA, Duh K, Andrews N (2020) Compressing BERT: studying the effects of weight pruning on transfer learning. In: RepL4NLP@ACL . https://doi.org/10.18653/v1/2020.repl4nlp-1.18

  36. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR

  37. Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, Qiu J, Zhang L, Han W, Huang M et al (2021) Pre-trained models: past, present and future. AI Open 2:225–250

  38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv

  39. Guo M-H, Xu T-X, Liu J-J, Liu Z-N, Jiang P-T, Mu T-J, Zhang S-H, Martin RR, Cheng M-M, Hu S-M (2022) Attention mechanisms in computer vision: a survey. Comput Visual Media 8(3):331–368

    Article  Google Scholar 

  40. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  41. Yun S, Jeong M, Kim R, Kang J, Kim HJ (2019) Graph transformer networks. Adv Neural Inf Process Syst 32

  42. Dehghani M, Djolonga J, Mustafa B, Padlewski P, Heek J, Gilmer J, Steiner AP, Caron M, Geirhos R, Alabdulmohsin I et al (2023) Scaling vision transformers to 22 billion parameters. In: International conference on machine learning. PMLR, pp 7480–7512

  43. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S et al (2022) Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  44. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77

  45. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.703

  46. Clark K, Luong M-T, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv

  47. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv

  48. Donahue J, Krähenbühl P, Darrell T (2016) Adversarial feature learning. arXiv

  49. Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: ECCV

  50. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training

  51. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog

  52. Caruana R (1997) Multitask learning. Mach Learn 28:41–75

  53. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv

  54. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv

  55. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692

  56. Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv

  57. Song K, Tan X, Qin T, Lu J, Liu T (2020) Mpnet: masked and permuted pre-training for language understanding. In: NeurIPS

  58. Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol (TIST) 13(2):1–41

    Google Scholar 

  59. Song K, Tan X, Qin T, Lu J, Liu T-Y (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv

  60. Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H-W (2019) Unified language model pre-training for natural language understanding and generation. arXiv

  61. Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019) Ernie: enhanced representation through knowledge integration. arXiv

  62. Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) Ernie 2.0: a continual pre-training framework for language understanding. In: AAAI

  63. Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-training with whole word masking for Chinese BERT. T-ASL. https://doi.org/10.1109/TASLP.2021.3124365

    Article  Google Scholar 

  64. Diao S, Bai J, Song Y, Zhang T, Wang Y (2020) ZEN: pre-training chinese text encoder enhanced by n-gram representations. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.425

  65. Tsai H, Riesa J, Johnson M, Arivazhagan N, Li X, Archer A (2019) Small and practical bert models for sequence labeling. arXiv

  66. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  67. Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng H-T, Jin A, Bos T, Baker L, Du Y et al (2022) Lamda: language models for dialog applications. arXiv preprint arXiv:2201.08239

  68. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv

  69. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP

  70. Dai AM, Le QV (2015) Semi-supervised sequence learning. arXiv

  71. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv

  72. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. TACL

  73. McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: Contextualized word vectors. arXiv

  74. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv

  75. Kong L, d’Autume CdM, Ling W, Yu L, Dai Z, Yogatama D (2019) A mutual information maximization perspective of language representation learning. arXiv

  76. Wang W, Bi B, Yan M, Wu C, Bao Z, Xia J, Peng L, Si L (2019) Structbert: incorporating language structures into pre-training for deep language understanding. arXiv

  77. Xiong W, Du J, Wang WY, Stoyanov V (2019) Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. arXiv

  78. Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, Smith NA (2019) Knowledge enhanced contextual word representations. arXiv

  79. Huang H, Liang Y, Duan N, Gong M, Shou L, Jiang D, Zhou M (2019) Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. arXiv

  80. Eisenschlos JM, Ruder S, Czapla P, Kardas M, Gugger S, Howard J (2019) Multifit: efficient multi-lingual language model fine-tuning. arXiv

  81. Beltagy I, Lo K, Cohan A (2019) Scibert: a pretrained language model for scientific text. arXiv

  82. Sun S, Cheng Y, Gan Z, Liu J (2019) Patient knowledge distillation for bert model compression. arXiv

  83. Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv

  84. Zafrir O, Boudoukh G, Izsak P, Wasserblat M (2019) Q8bert: quantized 8bit bert. In: 2019 Fifth workshop on energy efficient machine learning and cognitive computing-NeurIPS edition (EMC2-NIPS). IEEE, pp 36–39

  85. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv

  86. Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) Fastbert: a self-distilling bert with adaptive inference time. arXiv

  87. Martin L, Müller B, Suárez PJO, Dupont Y, Romary L, Clergerie É, Seddah D, Sagot B (2020) Camembert: a tasty french language model. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.645

  88. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.747

  89. Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer. In: ICLR

  90. Shen S, Dong Z, Ye J, Ma L, Yao Z, Gholami A, Mahoney MW, Keutzer K (2020) Q-bert: Hessian based ultra low precision quantization of bert. In: AAAI

  91. Chi Z, Dong L, Wei F, Wang W, Mao X-L, Huang H (2020) Cross-lingual natural language generation via pre-training. In: AAAI

  92. Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-bert: enabling language representation with knowledge graph. In: AAAI

  93. Jiang Z, Yu W, Zhou D, Chen Y, Feng J, Yan S (2020) Convbert: improving BERT with span-based dynamic convolution. In: NeurIPS

  94. Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M (2020) Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NeurIPS

  95. Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742

  96. Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, Zhang Z (2020) Colake: contextualized language and knowledge embedding. In: COLING. https://doi.org/10.18653/v1/2020.coling-main.327

  97. Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D (2020) Flaubert: unsupervised language model pre-training for French. In: LREC

  98. Shen T, Mao Y, He P, Long G, Trischler A, Chen W (2020) Exploiting structured knowledge in text via graph-guided representation learning. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.722

  99. Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) Tinybert: distilling BERT for natural language understanding. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.372

  100. Delobelle P, Winters T, Berendt B (2020) Robbert: a dutch roberta-based language model. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.292

  101. He B, Zhou D, Xiao J, Jiang X, Liu Q, Yuan NJ, Xu T (2020) Integrating graph contextualized knowledge into pre-trained language models. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.207

  102. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67

  103. Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: EACL

  104. Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, Tang J (2021) KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 9:176–194

  105. Gao T, Yao X, Chen D (2021) Simcse: simple contrastive learning of sentence embeddings. CoRR arXiv:2104.08821

  106. Du N, Huang Y, Dai AM, Tong S, Lepikhin D, Xu Y, Krikun M, Zhou Y, Yu AW, Firat O et al (2022) Glam: efficient scaling of language models with mixture-of-experts. In: International conference on machine learning. PMLR, pp 5547–5569

  107. Chi Z, Huang S, Dong L, Ma S, Singhal S, Bajaj P, Song X, Wei F (2021) Xlm-e: cross-lingual language model pre-training via electra. arXiv preprint arXiv:2106.16138

  108. Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, Chaffin A, Stiegler A, Scao TL, Raja A et al (2021) Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207

  109. Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, Aslanides J, Henderson S, Ring R, Young S et al (2021) Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446

  110. Smith S, Patwary M, Norick B, LeGresley P, Rajbhandari S, Casper J, Liu Z, Prabhumoye S, Zerveas G, Korthikanti V et al (2022) Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990

  111. Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, Casas DL, Hendricks LA, Welbl J, Clark A et al (2022) Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

  112. Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV et al (2022) Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  113. Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV (2022) Finetuned language models are zero-shot learners. In: International conference on learning representations

  114. Honovich O, Scialom T, Levy O, Schick T (2022) Unnatural instructions: tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689

  115. Wang Y, Mishra S, Alipoormolabashi P, Kordi Y, Mirzaei A, Naik A, Ashok A, Dhanasekaran AS, Arunkumar A, Stap D et al (2022) Super-natural instructions: generalization via declarative instructions on 1600+ nlp tasks. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp 5085–5109

  116. Mishra S, Khashabi D, Baral C, Hajishirz, H (2022) Cross-task generalization via natural language crowdsourcing instructions. In: Proceedings of the 60th Annual meeting of the association for computational linguistics (volume 1: long papers), pp 3470–3487

  117. Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H (2022) Self-instruct: aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560

  118. Weidinger L, Mellor J, Rauh M, Griffin C, Uesato J, Huang P-S, Cheng M, Glaese M, Balle B, Kasirzadeh A et al (2021) Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359

  119. Kiegeland S, Kreutzer J (2021) Revisiting the weaknesses of reinforcement learning for neural machine translation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1673–1681

  120. Jaques N, Shen JH, Ghandeharioun A, Ferguson C, Lapedriza A, Jones N, Gu S, Picard R (2020) Human-centric dialog training via offline reinforcement learning. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 3985–4003

  121. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  122. Pang RY, He H (2021) Text generation by learning from demonstrations. In: Proceedings of the international conference on learning representations

  123. Hausknecht M, Ammanabrolu P, Côté M-A, Yuan X (2020) Interactive fiction games: a colossal adventure. Proc AAAI Conf Artif Intell 34:7903–7910

    Google Scholar 

  124. Snell C, Kostrikov I, Su Y, Yang M, Levine S (2022) Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871

  125. Lu X, Welleck S, Jiang L, Hessel J, Qin L, West P, Ammanabrolu P, Choi Y (2022) Quark: controllable text generation with reinforced unlearning. arXiv preprint arXiv:2205.13636

  126. Uc-Cetina V, Navarro-Guerrero N, Martin-Gonzalez A, Weber C, Wermter S (2022) Survey on reinforcement learning for language processing. Artif Intell Rev 56:1543–1575

    Article  Google Scholar 

  127. Ramamurthy R, Ammanabrolu P, Brantley K, Hessel J, Sifa R, Bauckhage C, Hajishirzi H, Choi Y (2022) Is reinforcement learning (not) for natural language processing?: benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241

  128. Wu J, Ouyang L, Ziegler DM, Stiennon N, Lowe R, Leike J, Christiano P (2021) Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862

  129. Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, Hesse C, Jain S, Kosaraju V, Saunders W et al (2021) Webgpt: browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

  130. Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V, Ewalds T, Rauh M, Weidinger L, Chadwick M, Thacker P et al (2022) Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375

  131. Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, Chen A, Goldie A, Mirhoseini A, McKinnon C et al (2022) Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

  132. Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S et al (2022) Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416

  133. Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y (2022) Large language models are zero-shot reasoners. Adv Neu Inf Process Syst 35:22199–22213

  134. Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T (2014) Discriminative unsupervised feature learning with convolutional neural networks. Adv Neu Inf Process Syst

  135. Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller M, Brox T (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI. TPAMI-2015-05-0348.R1

  136. Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: ICCV

  137. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by inpainting. In: CVPR

  138. Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: ECCV

  139. Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR

  140. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV

  141. Kim D, Cho D, Yoo D, Kweon IS (2018) Learning image representations by completing damaged jigsaw puzzles. In: WACV

  142. Noroozi M, Pirsiavash H, Favaro P (2017) Representation learning by learning to count. In: ICCV

  143. Bojanowski P, Joulin A (2017) Unsupervised learning by predicting noise. In: ICML

  144. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv

  145. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv

  146. Henaff O (2020) Data-efficient image recognition with contrastive predictive coding. In: ICML

  147. Donahue J, Simonyan K (2019) Large scale adversarial representation learning. In: NeurIPS, pp 10541–10551

  148. Dumoulin V, Belghazi I, Poole B, Mastropietro O, Lamb A, Arjovsky M, Courville A (2016) Adversarially learned inference. arXiv

  149. Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: ICML

  150. Bao H, Dong L, Piao S, Wei F (2021) Beit: Bert pre-training of image transformers. In: International conference on learning representations

  151. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660

  152. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: International conference on machine learning. PMLR, pp 8821–8831

  153. He K, Chen X, Xi, S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16000–16009

  154. Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, Dai Q, Hu H (2022) Simmim: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9653–9663

  155. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. arXiv preprint arXiv:2304.02643

  156. Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR

  157. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  158. Li X, Wang W, Yang L, Yang J (2022) Uniform masking: enabling mae pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063

  159. Chen J, Hu M, Li B, Elhoseiny M (2022) Efficient self-supervised vision pretraining with local masked reconstruction. arXiv preprint arXiv:2206.00790

  160. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv

  161. Zhuang C, Zhai AL, Yamins D (2019) Local aggregation for unsupervised learning of visual embeddings. In: ICCV

  162. Misra I, Maaten L (2020) Self-supervised learning of pretext-invariant representations. In: CVPR

  163. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738

  164. Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv

  165. Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG et al (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv

  166. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: ICML

  167. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv

  168. Goyal P, Caron M, Lefaudeux B, Xu M, Wang P, Pai V, Singh M, Liptchinsky V, Misra I, Joulin A et al (2021) Self-supervised pretraining of visual features in the wild. arXiv

  169. Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P (2020) Designing network design spaces. In: CVPR

  170. Chen X, He K (2021) Exploring simple siamese representation learning. In: CVPR

  171. Li J, Zhou P, Xiong C, Hoi SCH (2021) Prototypical contrastive learning of unsupervised representations. In: ICLR. OpenReview.net

  172. Zhang L, Qi G-J, Wang L, Luo J (2019) Aet vs. aed: unsupervised representation learning by auto-encoding transformations rather than data. In: CVPR

  173. Bachman P, Hjelm RD, Buchwalter W (2019) Learning representations by maximizing mutual information across views. arXiv

  174. Yan X, Misra I, Gupta A, Ghadiyaram D, Mahajan D (2020) Clusterfit: improving generalization of visual representations. In: CVPR

  175. Asano YM, Rupprecht C, Vedaldi A (2019) Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371

  176. Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G (2020) Big self-supervised models are strong semi-supervised learners. arXiv

  177. Tian Y, Krishnan D, Isola P (2019) Contrastive multiview coding. arXiv

  178. Cubuk ED, Zoph B, Shlens J, Le QV (2019) Randaugment: practical data augmentation with no separate search. arXiv

  179. Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P (2020) What makes for good views for contrastive learning. arXiv

  180. Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. arXiv

  181. Mitrovic J, McWilliams B, Walker JC, Buesing LH, Blundell C (2021) Representation learning via invariant causal mechanisms. In: ICLR

  182. Tian Y, Chen X, Ganguli S (2021) Understanding self-supervised learning dynamics without contrastive pairs. In: ICML. Proceedings of machine learning research

  183. Xie Z, Lin Y, Yao Z, Zhang Z, Dai Q, Cao Y, Hu H (2021) Self-supervised learning with swin transformers. arXiv

  184. Li Z, Chen Z, Yang F, Li W, Zhu Y, Zhao C, Deng R, Wu L, Zhao R, Tang M et al (2021) Mst: masked self-supervised transformer for visual representation. Adv Neural Inf Process Syst 34

  185. Bao H, Dong L, Piao S, Wei F (2022) BEit: BERT pre-training of image transformers. In: International conference on learning representations. https://openreview.net/forum?id=p-BhZSz59o4

  186. Chen X, Ding M, Wang X, Xin Y, Mo S, Wang Y, Han S, Luo P, Zeng G, Wang J (2022) Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026

  187. Dong X, Bao J, Zhang T, Chen D, Zhang W, Yuan L, Chen D, Wen F, Yu N (2021) Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710

  188. You Y, Chen T, Wang Z, Shen Y (2020) When does self-supervision help graph convolutional networks? In: ICML. Proceedings of machine learning research, pp 10871–10880

  189. Jin W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020) Self-supervised learning on graphs: deep insights and new direction. CoRR arXiv:2006.10141

  190. Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande VS, Leskovec J (2020) Strategies for pre-training graph neural networks. In: ICLR

  191. Perozzi B, Al-Rfou R, Skien S (2014) Deepwalk: online learning of social representations. In: ACM SIGKDD

  192. Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: ACM SIGKDD

  193. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) LINE: large-scale information network embedding. In: WWW

  194. Kipf TN, Welling M (2016) Variational graph auto-encoders. CoRR

  195. Qiu J, Chen Q, Dong Y, Zhang J, Yang H, Ding M, Wang K, Tang J (2020) GCC: graph contrastive coding for graph neural network pre-training. In: KDD

  196. Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2021) Graph contrastive learning with adaptive augmentation. In: WWW

  197. You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. In: NeurIPS

  198. Mavromatis C, Karypis G (2021) Graph infoclust: maximizing coarse-grain mutual information in graphs. In: PAKDD

  199. Sun Q, Li J, Peng H, Wu J, Ning Y, Yu PS, He L (2021) SUGAR: subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In: WWW

  200. Velickovic P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2019) Deep graph infomax. In: ICLR

  201. Hassani K, Ahmadi AHK (2020) Contrastive multi-view representation learning on graphs. In: ICML. Proceedings of machine learning research, pp 4116–4126

  202. Jiao Y, Xiong Y, Zhang J, Zhang Y, Zhang T, Zhu Y (2020) Sub-graph contrast for scalable self-supervised graph representation learning. In: ICDM, pp 222–231

  203. Sun K, Lin Z, Zhu Z (2020) Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes. In: AAAI, pp 5892–5899

  204. Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J (2020) Self-supervised graph transformer on large-scale molecular data. In: NeurIPS

  205. Tan Q, Liu N, Huang X, Chen R, Choi S-H, Hu X (2022) Mgae: Masked autoencoders for self-supervised learning on graphs. arXiv preprint arXiv:2201.02534

  206. Hou Z, Liu X, Dong Y, Wang C, Tang J et al (2022) Graphmae: Self-supervised masked graph autoencoders. arXiv preprint arXiv:2205.10803

  207. Li J, Wu R, Sun W, Chen L, Tian S, Zhu L, Meng C, Zheng Z, Wang W (2022) Maskgae: masked graph modeling meets graph autoencoders. arXiv preprint arXiv:2205.10053

  208. Tian Y, Dong K, Zhang C, Zhang C, Chawla NV (2022) Heterogeneous graph masked autoencoders. arXiv preprint arXiv:2208.09957

  209. Wan S, Pan S, Yang J, Gong C (2021) Contrastive and generative graph convolutional networks for graph-based semi-supervised learning. In: AAAI

  210. Zhang J, Zhang H, Xia C, Sun L (2020) Graph-bert: only attention is needed for learning graph representations. arXiv arXiv:2001.05140

  211. Peng Z, Huang W, Luo M, Zheng Q, Rong Y, Xu T, Huang J (2020) Graph representation learning via graphical mutual information maximization. In: WWW

  212. Hu Z, Dong Y, Wang K, Chang K, Sun Y (2020) GPT-GNN: generative pre-training of graph neural networks. In: KDD

  213. Wang G, Guo H, Li A, Liu X, Yan Q (2023) Federated iot interaction vulnerability analysis. In: 2023 IEEE 39th international conference on data engineering (ICDE). IEEE

  214. Hamilton WL, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: NIPS

  215. Hwang D, Park J, Kwon S, Kim K, Ha J, Kim HJ (2020) Self-supervised auxiliary learning with meta-paths for heterogeneous graphs. In: NeurIPS

  216. Sun F, Hoffmann J, Verma V, Tang J (2020) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: ICLR

  217. Park C, Kim D, Han J, Yu H (2020) Unsupervised attributed multiplex network embedding. In: AAAI, pp 5371–5378

  218. You Y, Chen T, Shen Y, Wang Z (2021) Graph contrastive learning automated. CoRR arXiv:2106.07594

  219. Zeng J, Xie P (2021) Contrastive self-supervised learning for graph classification. In: AAAI, pp 10824–10832

  220. Xu M, Wang H, Ni B, Guo H, Tang J (2021) Self-supervised graph-level representation learning with local and global structure. CoRR arXiv:2106.04113

  221. Wang P, Agarwal K, Ham C, Choudhury S, Reddy CK (2021) Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks. In: WWW

  222. Cao J, Lin X, Guo S, Liu L, Liu T, Wang B (2021) Bipartite graph embedding via mutual information maximization. In: WSDM

  223. Wang X, Liu N, Han H, Shi C (2021) Self-supervised heterogeneous graph neural network with co-contrastive learning. KDD arXiv:2105.09111

  224. Kim D, Oh A (2021) How to find your friendly neighborhood: graph attention design with self-supervision. In: ICLR. https://openreview.net/forum?id=Wi5KUNlqWty

  225. Sun M, Xing J, Wang H, Chen B, Zhou J (2021) Mocl: contrastive learning on molecular graphs with multi-level domain knowledge. CoRR arXiv:2106.04509

  226. Schneider S, Baevski A, Collobert R, Auli M (2019) wav2vec: unsupervised pre-training for speech recognition. In: INTERSPEECH

  227. Baevski A, Schneider S, Auli M (2020) vq-wav2vec: self-supervised learning of discrete speech representations. In: ICLR

  228. Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS

  229. Chung Y, Glass JR (2020) Generative pre-training for speech with autoregressive predictive coding. In: ICASSP

  230. Song X, Wang G, Huang Y, Wu Z, Su D, Meng H (2020) Speech-xlnet: unsupervised acoustic model pretraining for self-attention networks. In: INTERSPEECH

  231. Chung Y, Wang Y, Hsu W, Zhang Y, Skerry-Ryan RJ (2019) Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: ICASSP

  232. Denisov P, Vu NT (2020) Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal teacher-student learning. In: Interspeech

  233. Chung Y-A, Zhu C, Zeng M (2021) SPLAT: speech-language joint pre-training for spoken language understanding. In: ACL

  234. Zeng M, Tan X, Wang R, Ju Z, Qin T, Liu T-Y (2021) Musicbert: symbolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630

  235. Huang Y-S, Yang Y-H (2020) Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM international conference on multimedia, pp 1180–1188

  236. Verma P, Berger J (2021) Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335

  237. Fernando B, Bilen H, Gavves E, Gould S (2017) Self-supervised video representation learning with odd-one-out networks. In: CVPR

  238. Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV

  239. Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: AAAI

  240. Tao L, Wang X, Yamasaki T (2020) Self-supervised video representation learning using inter-intra contrastive framework. In: ACM multimedia

  241. Lorre G, Rabarisoa J, Orcesi A, Ainouz S, Canu S (2020) Temporal contrastive pretraining for video action recognition. In: WACV

  242. Yao T, Zhang Y, Qiu Z, Pan Y, Mei T (2020) Seco: Exploring sequence supervision for unsupervised representation learning. arXiv

  243. Li LH, Yatskar M, Yin D, Hsieh C, Chang K (2019) Visualbert: a simple and performant baseline for vision and language. CoRR arXiv:1908.03557

  244. Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI

  245. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR

  246. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS

  247. Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125

  248. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763

  249. Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M (2021) Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741

  250. Sayed N, Brattoli B, Ommer B (2018) Cross and learn: cross-modal self-supervision. In: GCPR

  251. Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR

  252. Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: ECCV

  253. Zlotchevski A, Drain D, Svyatkovskiy A, Clement CB, Sundaresan N, Tufano M (2022) Exploring and evaluating personalized models for code generation. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pp 1500–1508

  254. Thakur, S., Ahmad B, Fan Z, Pearce H, Tan B, Karri R, Dolan-Gavitt B, Garg S (2022) Benchmarking large language models for automated verilog rtl code generation. arXiv preprint arXiv:2212.11140

  255. Nijkamp E, Pang B, Hayashi H, Tu L, Wang H, Zhou Y, Savarese S, Xiong C (2022) Codegen: an open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474

  256. Poesia G, Polozov O, Le V, Tiwari A, Soares G, Meek C, Gulwani S (2022) Synchromesh: reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227

  257. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: universal image-text representation learning. In: European conference on computer vision. Springer, pp 104–120

  258. Zhu X, Zhu J, Li H, Wu X, Li H, Wang X, Dai J (2022) Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16804–16815

  259. Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT et al (2022) A generalist agent. arXiv preprint arXiv:2205.06175

  260. Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, Wu H, Wang H (2020) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409

  261. Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814

  262. Achiam J, Sastry S (2017) Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732

  263. Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: International conference on machine learning. PMLR, pp 2778–2787

  264. Tang H, Houthooft R, Foote D, Stooke A, Xi Chen O, Duan Y, Schulman J, DeTurck F, Abbeel P (2017) # Exploration: a study of count-based exploration for deep reinforcement learning. Adv Neural Inf Process Syst 30

  265. Dey P, Medya S (2019) Manipulating node similarity measures in network. arXiv

  266. Han B, Zheng C, Chan H, Paster K, Zhang M, Ba J (2021) Learning domain invariant representations in goal-conditioned block mdps. Adv Neural Inf Process Syst 34:764–776

    Google Scholar 

  267. Ding Y, Florensa C, Abbeel P, Phielipp M (2019) Goal-conditioned imitation learning. Adv Neural Inf Process Syst 32

  268. Shah R, Kumar V (2021) Rrl: Resnet as representation for reinforcement learning. arXiv preprint arXiv:2107.03380

  269. Xiao T, Radosavovic I, Darrell T, Malik J (2022) Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173

  270. Schwarzer M, Rajkumar N, Noukhovitch M, Anand A, Charlin L, Hjelm RD, Bachman P, Courville AC (2021) Pretraining representations for data-efficient reinforcement learning. Adv Neural Inf Process Syst 34:12686–12699

    Google Scholar 

  271. Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929

  272. Ha D, Schmidhuber J (2018) World Models https://doi.org/10.5281/zenodo.1207631arXiv:1803.10122 [cs, stat]

  273. Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv. https://doi.org/10.48550/arXiv.1611.05397

  274. Higgins I, Pal A, Rusu AA, Matthey L, Burgess CP, Pritzel A, Botvinick M, Blundell C, Lerchner A (2018) DARLA: improving zero-shot transfer in reinforcement learning. arXiv. https://doi.org/10.48550/arXiv.1707.08475

  275. Finn C, Yu T, Fu J, Abbeel P, Levine S (2016) Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429

  276. Shah R, Kumar V (2021) RRL: Resnet as representation for reinforcement learning. arXiv

  277. Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2021) Data-efficient reinforcement learning with self-predictive representations. arXiv. https://doi.org/10.48550/arXiv.2007.05929

  278. Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603

  279. Hafner D, Lillicrap T, Norouzi M, Ba J (2020) Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193

  280. Deng F, Jang I, Ah, S (2022) Dreamerpro: reconstruction-free model-based reinforcement learning with prototypical representations. In: International conference on machine learning. PMLR, pp 4956–4975

  281. Wu P, Escontrela A, Hafner D, Goldberg K, Abbeel P (2022) Daydreamer: world models for physical robot learning. arXiv preprint arXiv:2206.14176

  282. Laskin M, Srinivas A, Abbeel P (2020) Curl: contrastive unsupervised representations for reinforcement learning. In: International conference on machine learning. PMLR, pp 5639–5650

  283. Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Adv Neural Inf Process Syst 33:19884–19895

    Google Scholar 

  284. Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649

  285. Yarats D, Fergus R, Lazaric A, Pinto L (2021) Mastering visual continuous control: improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645

  286. Nair S, Rajeswaran A, Kumar V, Finn C, Gupta A (2022) R3m: a universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601

  287. Parisi S, Rajeswaran A, Purushwalkam S, Gupta A (2022) The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580

  288. Zhou C, Yan Q, Shi Y, Sun L (2022) \(\{\)DoubleStar\(\}\):\(\{\)Long-Range\(\}\) attack towards depth estimation based obstacle avoidance in autonomous systems. In: 31st USENIX security symposium (USENIX Security 22), pp 1885–1902

  289. Zhou C, Yan Q, Kent D, Wang G, Zhang Z, Radha H (2024) Optical lens attack on deep learning based monocular depth estimation. arXiv preprint arXiv:2409.17376

  290. Zhou C, Yan Q, Liu S (2024) Transient adversarial 3d projection attacks on object detection in autonomous driving. arXiv preprint arXiv:2409.17403

  291. Jin D, Jin Z, Zhou JT, Szolovits P (2020) Is bert really robust? A strong baseline for natural language attack on text classification and entailment. In: AAAI

  292. Zang Y, Qi F, Yang C, Liu Z, Zhang M, Liu Q, Sun M (2020) Word-level textual adversarial attacking as combinatorial optimization. In: ACL

  293. Wallace E, Feng S, Kandpal N, Gardner M, Singh S (2019) Universal adversarial triggers for attacking and analyzing NLP. In: EMNLP-IJCNLP

  294. Kurita K, Michel P, Neubig G (2020) Weight poisoning attacks on pretrained models. In: ACL

  295. Schuster R, Schuster T, Meri Y, Shmatikov V (2020) Humpty dumpty: controlling word meanings via corpus poisoning. In: IEEE symposium on security and privacy

  296. Bao R, Wang J, Zhao H (2021) Defending pre-trained language models from adversarial word substitution without performance sacrifice. In: ACL/IJCNLP

  297. Zhang Z, Li Y, Wang J, Liu B, Li D, Guo Y, Chen X, Liu Y (2022) Remos: reducing defect inheritance in transfer learning via relevant model slicing. In: Proceedings of the 44th international conference on software engineering, pp 1856–1868

  298. Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown TB, Song D, Erlingsson U et al (2021) Extracting training data from large language models. In: USENIX security symposium, vol 6

  299. Wang G, Zhang L, Yang Z, Li X-Y (2018) Socialite: social activity mining and friend auto-labeling. In: 2018 IEEE 37th international performance computing and communications conference (IPCCC). IEEE, pp 1–8

  300. Han F, Zhang L, You X, Wang G, Li X-Y (2019) Shad: privacy-friendly shared activity detection and data sharing. In: 2019 IEEE 16th international conference on mobile ad hoc and sensor systems (MASS). IEEE, pp 109–117

  301. Chen T, Zhai X, Ritter M, Lucic M, Houlsby N (2019) Self-supervised gans via auxiliary rotation loss. In: CVPR

  302. Abnar S, Dehghani M, Neyshabur B, Sedghi H (2021) Exploring the limits of large scale pre-training. arXiv

  303. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: NAACL-HLT

  304. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS

  305. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

  306. Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In: EMNLP. https://doi.org/10.3115/v1/d14-1113

  307. Zhou P, Qi Z, Zheng S, Xu J, Bao H, Xu B (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING

  308. Hui DU, Xueke XU, Dayong WU, Liu Y, Zhihua YU, Cheng X (2017) A sentiment classification method based on sentiment-specific word embedding. J Chin Inf Process 31(3):170–176

  309. Liu Y, Ma C, Zhang Y (2017) Hierarchical machine translation model based on deep recursive neural network. Chin J Comput 40(4):861–871

  310. Liang X, Ren F, Liu Y, Pan L, Hou Y, Zhang Y, Yan LI (2018) N-reader: machine reading comprehension model based on double layers of self-attention. J Chin Inf Process

  311. Zhichang Z, Zhenwen Z, Zhiman Z (2019) User intent classification based on indrnn-attention. J Comput Res Dev

  312. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition

  313. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst

  314. Lin M, Chen Q, Yan S (2013) Network in network. arXiv

  315. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv

  316. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR

  317. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR

  318. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR

  319. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR

  320. Girshick R (2015) Fast r-cnn. In: ICCV

  321. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv

  322. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV

  323. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR

  324. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision

  325. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: CVPR

  326. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: CVPR

  327. Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv

  328. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv

  329. Chen Q, Wang Y, Yang T, Zhang X, Cheng J, Sun J (2021) You only look one-level feature. arXiv

  330. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

  331. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR

  332. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv

  333. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

  334. Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv

  335. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV

  336. Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR

  337. Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vision 111:98-136

  338. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88:303–338

  339. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision

  340. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

  341. Jordan MI (1997) Serial order: a parallel distributed processing approach. In: Advances in psychology, vol 121, pp 471–495. North-Holland

  342. Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211

  343. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput MIT-Press

  344. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868

  345. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR

  346. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv

  347. Graves A (2013) Generating sequences with recurrent neural networks. arXiv

  348. Sundermeyer M, Schlüter R, Ney H (2012) Lstm neural networks for language modeling. In: INTERSPEECH

  349. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv

  350. Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: CVPR

  351. Li C, Wand M (2016) Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: European conference on computer vision

  352. Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV

  353. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: CVPR

  354. Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International conference on machine learning

  355. Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with generative adversarial networks. In: International conference on machine learning

  356. Denton E, Chintala S, Szlam A, Fergus R (2015) Deep generative image models using a laplacian pyramid of adversarial networks. arXiv

  357. Huang X, Li Y, Poursaeed O, Hopcroft J, Belongie S (2017) Stacked generative adversarial networks. In: CVPR

  358. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR

  359. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: ECCV

  360. Cao Y, Xu J, Lin S, Wei F, Hu H (2019) Gcnet: non-local networks meet squeeze-excitation networks and beyond. In: ICCV workshops

  361. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: ICCV

  362. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning

  363. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2020) Training data-efficient image transformers & distillation through attention. arXiv

  364. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision

  365. Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. arXiv

  366. Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, Hou Q, Feng J (2021) Deepvit: towards deeper vision transformer. arXiv

  367. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv

  368. Guan T, Wang J, Lan S, Chandra R, Wu Z, Davis L, Manocha D (2021) M3detr: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. arXiv

  369. Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM (2021) Medical transformer: gated axial-attention for medical image segmentation. arXiv

  370. Lee RCT, Chin YH, Chang SC (1976) Application of principal component analysis to multikey searching. IEEE Trans Softw Eng 3:185–193

    Article  Google Scholar 

  371. Ye J, Janardan R, Li Q (2004) Two-dimensional linear discriminant analysis. In: Advances in neural information processing systems vol 17 [Neural information processing systems, NIPS 2004, December 13–18, 2004, Vancouver, British Columbia, Canada], pp 1569–1576

  372. Robinson S, Bennett R (1995) A typology of deviant workplace behaviors: a multidimensional scaling study. Acad Manag J 38:555–572

    Article  Google Scholar 

  373. Samko O, Marshall AD, Rosin PL (2006) Selection of the optimal parameter value for the isomap algorithm. Pattern Recogn. Lett. 9:968–979

    Article  Google Scholar 

  374. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326

    Article  Google Scholar 

  375. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 6:1373–1396

    Article  Google Scholar 

  376. Singh AP, Gordon GJ. Relational learning via collective matrix factorization. In: ACM SIGKDD, pp 650–658

  377. Cao S, Lu W, Xu Q (2015) Grarep: learning graph representations with global structural information. In: CIKM, pp 891–900

  378. Ou M, Cui P, Pei J, Zhang Z, Zhu W (2016) Asymmetric transitivity preserving graph embedding. In: ACM SIGKDD, pp 1105–1114

  379. Sugiyama M, Borgwardt KM (2015) Halting in random walk kernels. In: NIPS

  380. Kang U, Tong H, Sun J (2012) Fast random walk graph kernel. In: SIAM, pp 828–838

  381. Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler-lehman graph kernels. J Mach Learn Res 12:2539–2561

    MathSciNet  Google Scholar 

  382. Erhan D, Manzagol P-A, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Artificial intelligence and statistics

  383. Erhan D, Courville A, Bengio Y, Vincent P (2010) Why does unsupervised pre-training help deep learning? In: AISTATS

  384. Lee JD, Lei Q, Saunshi N, Zhuo J (2020) Predicting what you already know helps: provable self-supervised learning. arXiv

  385. Tosh C, Krishnamurthy A, Hsu D (2021) Contrastive learning, multi-view redundancy, and linear models. In: Algorithmic learning theory

  386. Arora S, Khandeparkar H, Khodak M, Plevrakis O, Saunshi N (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv

  387. Anwar S, Tahir M, Li C, Mian A, Khan FS, Muzaffar AW (2020) Image colorization: a survey and dataset. arXiv

  388. Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z et al (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR

  389. Perarnau G, Van De Weijer J, Raducanu B, Álvarez JM (2016) Invertible conditional gans for image editing. arXiv

  390. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. arXiv

  391. Tulyakov S, Liu M-Y, Yang X, Kautz J (2018) Mocogan: decomposing motion and content for video generation. In: CVPR

  392. Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: ICCV

  393. Wei C, Xie L, Ren X, Xia Y, Su C, Liu J, Tian Q, Yuille AL (2019) Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: CVPR

  394. Ahsan U, Madhok R, Essa I (2019) Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In: WACV

  395. Pathak D, Girshick R, Dollár P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR

  396. Croitoru I, Bogolin S-V, Leordeanu M (2017) Unsupervised learning from video to detect foreground objects in single images. In: ICCV

  397. Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: self-supervised learning from video. In: ICRA

  398. Korbar B, Tran D, Torresani L (2018) Cooperative learning of audio and video models from self-supervised synchronization. arXiv

  399. Manning CD, Raghavan P, Schütze H (2008). Introduction to information retrieval. https://doi.org/10.1017/CBO9780511809071

    Article  Google Scholar 

  400. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn. https://doi.org/10.1023/A:1007614523901

    Article  Google Scholar 

  401. Reiter E (2018) A structured review of the validity of bleu. Comput Linguist

  402. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation

  403. Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL

  404. Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP. https://doi.org/10.3115/v1/d14-1181

  405. Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: ACL

  406. Yang M, Zhao W, Ye J, Lei Z, Zhao Z, Zhang S (2018) Investigating capsule networks with dynamic routing for text classification. In: EMNLP

  407. Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: AAAI. https://doi.org/10.1609/aaai.v33i01.33017370

  408. Wang Y, Sun A, Han J, Liu Y, Zhu X (2018) Sentiment analysis by capsules In: WWW. https://doi.org/10.1145/3178876.3186015

  409. Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP

  410. Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. In: ACL. https://doi.org/10.3115/v1/p15-1150

  411. Zhu X, Sobhani P, Guo H (2015) Long short-term memory over recursive structures. In: ICML

  412. Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. In: EMNLP. https://doi.org/10.18653/v1/d16-1053

  413. Liu P, Qiu X, Chen X, Wu S, Huang X (2015) Multi-timescale long short-term memory neural network for modelling sentences and documents. In: EMNLP. https://doi.org/10.18653/v1/d15-1280

  414. Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI

  415. Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: EMNLP

  416. Shen T, Zhou T, Long G, Jiang J, Zhang C (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. In: ICLR

  417. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML

  418. Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL. https://doi.org/10.3115/v1/p15-1162

  419. Miyato T, Dai AM, Goodfellow IJ (2017) Adversarial training methods for semi-supervised text classification. In: ICLR

  420. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: AAAI

  421. Johnson R, Zhang T (2016) Supervised and semi-supervised text categorization using LSTM for region embeddings. In: ICML

  422. Bao Y, Wu M, Chang S, Barzilay R (2020) Few-shot text classification with distributional signatures. In: ICLR

  423. Wu F, Jr AHS, Zhang T, Fifty C, Yu T, Weinberger KQ (2019) Simplifying graph convolutional networks. In: ICML

  424. Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: NIPS

  425. Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In: ACL. https://doi.org/10.18653/v1/P17-1052

  426. Wang J, Wang Z, Zhang D, Yan J (2017) Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI. https://doi.org/10.24963/ijcai.2017/406

  427. Huang L, Ma D, Li S, Zhang X, Wang H (2019) Text level graph neural network for text classification. In: EMNLP-IJCNLP. https://doi.org/10.18653/v1/D19-1345

  428. Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune BERT for text classification? In: CCL. https://doi.org/10.1007/978-3-030-32381-3_16

  429. Yang Z, Yang D, Dyer C, He X, Smola AJ, Hovy EH (2016) Hierarchical attention networks for document classification. In: NAACL-HLT

  430. Bowman SR. Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In: EMNLP.https://doi.org/10.18653/v1/d15-1075

  431. Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. In: IJCAI. https://doi.org/10.24963/ijcai.2017/579

  432. Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. In: ACL. https://doi.org/10.18653/v1/p19-1441

  433. Williams A, Nangia N, Bowman SR (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: NAACL-HLT. https://doi.org/10.18653/v1/n18-1101

  434. Marelli M, Bentivogli L, Baroni M, Bernardi R, Menini S, Zamparelli R (2014) Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval@COLING. https://doi.org/10.3115/v1/s14-2001

  435. Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: COLING

  436. Fu J, Liu P, Neubig G (2020) Interpretable multi-dataset evaluation for named entity recognition. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.489

  437. Lester B, Pressel D, Hemmeter A, Choudhury SR, Bangalore S (2020) Constrained decoding for computationally efficient named entity recognition taggers. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.166

  438. Luo Y, Zhao H, Zhan J (2020) Named entity recognition only from word embeddings. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.723

  439. Li X, Feng J, Meng Y, Han Q, Wu F, Li J (2020) A unified MRC framework for named entity recognition. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.519

  440. Zhang Y, Yang J (2018) Chinese NER using lattice LSTM. In: ACL. https://doi.org/10.18653/v1/P18-1144

  441. Meng Y, Wu W, Wang F, Li X, Nie P, Yin F, Li M, Han Q, Sun X, Li J (2019) Glyce: Glyph-vectors for Chinese character representations. In: NeurIPS

  442. Katiyar A, Cardie C (2018) Nested named entity recognition revisited. In: NAACL-HLT. https://doi.org/10.18653/v1/n18-1079

  443. Wang B, Lu W (2018) Neural segmental hypergraphs for overlapping mention recognition. In: EMNLP. https://doi.org/10.18653/v1/d18-1019

  444. Luan Y, Wadden D, He L, Shah A, Ostendorf M, Hajishirzi H (2019) A general framework for information extraction using dynamic span graphs. In: NAACL-HLT. https://doi.org/10.18653/v1/n19-1308

  445. Shibuya T, Hovy EH (2020) Nested named entity recognition via second-best sequence learning and decoding. Trans Assoc Comput Linguist 8:605–620

  446. Lin H, Lu Y, Han X, Sun L (2019) Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. In: ACL. https://doi.org/10.18653/v1/p19-1511

  447. Lai G, Xie Q, Liu H, Yang Y, Hovy EH (2017) RACE: large-scale reading comprehension dataset from examinations. In: EMNLP. https://doi.org/10.18653/v1/d17-1082

  448. Yang Y, Yih W, Meek C (2015) Wikiqa: a challenge dataset for open-domain question answering. In: EMNLP. https://doi.org/10.18653/v1/d15-1237

  449. Santos CN, Tan M, Xiang B, Zhou B (2016) Attentive pooling networks. CoRR arXiv:1602.03609

  450. Lee JY, Dernoncourt F (2016) Sequential short-text classification with recurrent and convolutional neural networks. In: NAACL-HLT. https://doi.org/10.18653/v1/n16-1062

  451. Kim S, D’Haro LF, Banchs RE, Williams JD, Henderson M (2016) The fourth dialog state tracking challenge. In: Dialogues with social robots—enablements, analyses, and evaluation, seventh international workshop on spoken dialogue systems, IWSDS 2016, Saariselkä, Finland, January 13–16, 2016. https://doi.org/10.1007/978-981-10-2585-3_36

  452. Ang J, Liu Y, Shriberg E (2005) Automatic dialog act segmentation and classification in multiparty meetings. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005.https://doi.org/10.1109/ICASSP.2005.1415300

  453. Wan Y, Yan W, Gao J, Zhao Z, Wu J, Yu PS (2018) Improved dynamic memory network for dialogue act classification with adversarial training. In: IEEE international conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10–13, 2018. https://doi.org/10.1109/BigData.2018.8622245

  454. Raheja V, Tetreault JR (2019) Dialogue act classification with context-aware self-attention. In: Proc. NAACL, 2019. https://doi.org/10.18653/v1/n19-1373

  455. Xu J, Gan Z, Cheng Y, Liu J (2020) Discourse-aware neural extractive text summarization. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.451

  456. Zou Y, Zhang X, Lu W, Wei F, Zhou M (2020) Pre-training for abstractive document summarization by reinstating source text. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.297

  457. Liu L, Lu Y, Yang M, Qu Q, Zhu J, Li H (2018) Generative adversarial network for abstractive text summarization. In: AAAI

  458. Yang M, Qu Q, Tu W, Shen Y, Zhao Z, Chen X (2019) Exploring human-like reading strategy for abstractive text summarization. In: AAAI. https://doi.org/10.1609/aaai.v33i01.33017362

  459. Bhandari M, Gour PN, Ashfaq A, Liu P, Neubig G (2020) Re-evaluating evaluation in text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.751

  460. Dong Y, Wang S, Gan Z, Cheng Y, Cheung JCK, Liu J (2020) Multi-fact correction in abstractive text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.749

  461. Huang D, Cui L, Yang S, Bao G, Wang K, Xie J, Zhang Y (2020) What have we achieved on text summarization? In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.33

  462. Kryscinski W, Paulus R, Xiong C, Socher R (2018) Improving abstraction in text summarization. In: EMNLP. https://doi.org/10.18653/v1/d18-1207

  463. Kryscinski W, McCann B, Xiong C, Socher R (2020) Evaluating the factual consistency of abstractive text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.750

  464. Kouris P, Alexandridis G, Stafylopatis A (2019) Abstractive text summarization based on deep learning and semantic content generalization. In: ACL. https://doi.org/10.18653/v1/p19-1501

  465. Chen K, Wang R, Utiyama M, Sumita E (2020) Content word aware neural machine translation. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.34

  466. Lin Z, Pan X, Wang M, Qiu X, Feng J, Zhou H, Li L (2020) Pre-training multilingual neural machine translation by leveraging alignment information. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.210

  467. Bugliarello E, Okazaki N (2020) Enhancing machine translation with dependency-aware self-attention. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.147

  468. Aji AF, Bogoychev N, Heafield K, Sennrich R (2020) In neural machine translation, what does transfer learning transfer? In: ACL. https://doi.org/10.18653/v1/2020.acl-main.688

  469. Baziotis C, Haddow B, Birch A (2020) Language model prior for low-resource neural machine translation. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.615

  470. Cui Q, Huang S, Li J, Geng X, Zheng Z, Huang G, Chen J (2021) Directqe: Direct pretraining for machine translation quality estimation. In: AAAI

  471. Wu C, Hoi SCH, Socher R, Xiong C (2020) TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.66

  472. Campagna G, Foryciarz A, Moradshahi M, Lam MS (2020) Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.12

  473. Liu Q, Yu L, Rimell L, Blunsom P (2021) Pretraining the noisy channel model for task-oriented dialogue. CoRR arXiv:2103.10518

  474. SST Corpus. http://nlp.stanford.edu/sentiment (2013)

  475. Pang B, Lee L (2005) Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL

  476. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv

  477. Hendrickx I, Kim SN, Kozareva Z, Nakov P, Séaghdha DÓ, Padó S, Pennacchiotti M, Romano L, Szpakowicz S (2009) Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In: Proc. NAACL, 2009

  478. Wiebe J, Wilson T, Cardie C (2005) Annotating expressions of opinions and emotions in language. Lang Resour Eval. https://doi.org/10.1007/s10579-005-7880-9

    Article  Google Scholar 

  479. MPQA Corpus. http://www.cs.pitt.edu/mpqa/ (2005)

  480. Diao Q, Qiu M, Wu C, Smola AJ, Jiang J, Wang C (2014) Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In: ACM SIGKDD. https://doi.org/10.1145/2623330.2623758

  481. 20NG Corpus. http://ana.cachopo.org/datasets-for-single-label-text-categorization (2007)

  482. AG Corpus. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html (2004)

  483. Reuters Corpus. https://www.cs.umb.edu/~smimarog/textmining/datasets/ (2007)

  484. Reuters Corpus. https://martin-thoma.com/nlp-reuters (2017)

  485. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Kleef P, Auer S, Bizer C (2015) Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web. https://doi.org/10.3233/SW-140134

    Article  Google Scholar 

  486. Ohsumed Corpus (2015) http://davis.wpi.edu/xmdv/datasets/ohsumed.html

  487. Williams A, Nangia N, Bowman SR (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv

  488. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv

  489. Levesque H, Davis E, Morgenstern L (2012) The winograd schema challenge. In: Thirteenth international conference on the principles of knowledge representation and reasoning

  490. Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: IWP

  491. Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: unanswerable questions for squad. arXiv

  492. Lai G, Xie Q, Liu H, Yang Y, Hovy E (2017) Race: large-scale reading comprehension dataset from examinations. arXiv

  493. Jurafsky D, Shriberg E (1997) Switchboard swbd-damsl shallow-discourse-function annotation coders manual

  494. Li J, Zhou P, Xiong C, Socher R, Hoi SC (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966

  495. Donahue J, Simonyan K (2019) Large scale adversarial representation learning. Adv Neural Inf Process Syst 32

  496. He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2021) Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377

  497. http://yann.lecun.com/exdb/mnist/

  498. http://ufldl.stanford.edu/housenumbers/

  499. https://www.cs.toronto.edu/~kriz/index.html

  500. Coates A, Ng A, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics

  501. https://cs.stanford.edu/~acoates/stl10/

  502. http://www.vision.caltech.edu/Image_Datasets/Caltech101/

  503. Miller GA (1998) WordNet: an electronic lexical database

  504. https://image-net.org/

  505. https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/

  506. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCV

  507. https://www.crcv.ucf.edu/data/UCF101.php

  508. https://www.crcv.ucf.edu/data/UCF50.php

  509. Bossard L, Guillaumin M, Van Gool L (2014) Food-101—mining discriminative components with random forests. In: European conference on computer vision

  510. Berg T, Liu J, Woo Lee S, Alexander ML, Jacobs DW, Belhumeur PN (2014) Birdsnap: large-scale fine-grained visual categorization of birds. In: CVPR

  511. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition

  512. Xiao J, Ehinger KA, Hays J, Torralba A, Oliva A (2016) Sun database: exploring a large collection of scene categories. Int J Comput Vis 119:3–22

    Article  MathSciNet  Google Scholar 

  513. http://places.csail.mit.edu/downloadData.html

  514. http://ai.stanford.edu/~jkrause/cars/car_dataset.html

  515. Maji S, Kannala J, Rahtu E, Blaschko,M. Vedaldi A (2013) Fine-grained visual classification of aircraft. Technical report

  516. https://sites.google.com/site/fgcomp2013/

  517. https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/

  518. https://www.robots.ox.ac.uk/~vgg/data/pets/

  519. https://www.robots.ox.ac.uk/~vgg/data/flowers/

  520. https://www.robots.ox.ac.uk/~vgg/data/dtd/

  521. https://sites.google.com/view/fgvc5/competitions/inaturalist

  522. https://www.inaturalist.org/

  523. Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852

  524. http://host.robots.ox.ac.uk/pascal/VOC/

  525. http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html

  526. http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html

  527. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html

  528. Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: CVPR

  529. Zhou B, Zhao H, Puig X, Xiao T, Fidler S, Barriuso A, Torralba A (2019) Semantic understanding of scenes through the ade20k dataset. Int J Comput Vis 127:301–321

    Article  Google Scholar 

  530. https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

  531. Cordts M, Omran M, Ramos S, Scharwächter T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2015) The cityscapes dataset. In: CVPR workshop on the future of datasets in vision

  532. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE conference on computer vision and pattern recognition (CVPR)

  533. Gupta A, Dollar P, Girshick R (2019) LVIS: A dataset for large vocabulary instance segmentation. In: CVPR

  534. https://davischallenge.org/

  535. https://davischallenge.org/davis2017/code.html

  536. Doersch C. Data analysis project: what makes Paris look like Paris?

  537. http://www.cs.toronto.edu/~nitish/unsupervised_video/

  538. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning

  539. Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) Yfcc100m: the new data in multimedia research. Commun ACM

  540. http://projects.dfki.uni-kl.de/yfcc100m/

  541. Jin W, Liu X, Zhao X, Ma Y, Shah N, Tang J (2021) Automated self-supervised learning for graphs. CoRR arXiv:2106.05470

  542. Peng Z, Dong Y, Luo M, Wu X, Zheng Q (2020) Self-supervised graph representation learning via global context prediction. CoRR arXiv:2003.01604

  543. Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020) Deep graph contrastive representation learning. CoRR arXiv:2006.04131

  544. Jin M, Zheng Y, Li Y, Gong C, Zhou C, Pan S (2021) Multi-scale contrastive siamese networks for self-supervised graph representation learning. CoRR arXiv:2105.05682

  545. Hu Z, Fan C, Chen T, Chang K, Sun Y (2019) Pre-training graph neural networks for generic structural feature extraction. CoRR arXiv:1905.13728

  546. Zhu Y, Xu Y, Yu F, Wu S, Wang L (2020) Cagnn: cluster-aware graph neural networks for unsupervised graph representation learning. arXiv preprint arXiv:2009.01674

  547. Zhang H, Lin S, Liu W, Zhou P, Tang J, Liang X, Xing EP (2020) Iterative graph self-distillation. CoRR arXiv:2010.12609

  548. Lin S, Zhou P, Hu Z-Y, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X (2021) Prototypical graph contrastive learning. IEEE trans neural networks learn syst 35(2):2747–2758

  549. Subramonian A (2021) Motif-driven contrastive learning of graph representations. Proc AAAI Conf Artif Intell 35:15980–15981

    Google Scholar 

  550. Opolka FL, Solomon A, Cangea C, Velickovic P, Liò P, Hjelm RD (2019) Spatio-temporal deep graph infomax. CoRR

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ce Zhou or Qian Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Basic components

1.1 A.1 Basic components on NLP

Table 4 Commonly used notations on NLP and graph

1.1.1 A.1.1 Language model

With the rapid development of deep learning, LMs are more and more applicable to the pretraining of NLP models. The LM can estimate the probability of rationality of a paragraph of the text. There are two main types of LMs: statistical LM and neural network LM.

Statistical LM The statistical LM is a mathematical model to solve the context-related characteristics of natural language from the perspective of probability and statistics. The core of statistical LMs is to determine the probability of a sentence appearing in a text. As the theoretical basis of the probabilistic LM, the N-gram model profoundly influences the subsequent LM. It plays a pivotal role in the field of the LM. The N-gram LM introduces the Markov hypothesis, which assumes that the probability of the occurrence of the current word only depends on the nearest \(n-1\) words. The maximum likelihood probability of a word \(w_{i}\) can be calculated by

$$\begin{aligned} {\begin{matrix} p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{N} \right) & = p\left( w_{i} \mid w_{i-n+1}, w_{i-n+2}, \ldots , w_{i-1} \right) \\ & = \frac{C\left( w_{i-n+1}, w_{i-n+2}, \ldots , w_{i}\right) }{\sum _{N} C\left( w_{i-n+1}, w_{i-n+2}, \ldots , w_{i-1} \right) }, \end{matrix}} \end{aligned}$$
(16)

where \(T=[w_{1}, w_{2}, \ldots , w_{N}]\) is the text sequence and \(C(w_{i-n+1}, w_{i-n+2}, \ldots , w_{i})\) is the co-occurrence frequency of \((w_{i-n+1}, w_{i-n+2}, \ldots , w_{i})\). The \(p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{N} \right)\) is calculated according to the chain rule

$$\begin{aligned} p\left( w_{1}, w_{2}, \ldots , w_{N}\right) =\prod _{i=1}^{N} p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{i-1}\right) . \end{aligned}$$
(17)

N-gram uses the probabilities of each word in the sequence to represent the co-occurrence probability of the whole text sequence. When N is large, it indicates a more vital constraint on the occurrence of the next word in the sequence and leads to more sparse frequency information. When N is small, the statistical results have higher reliability and better generalization ability, but the constraint will be weaker.

Neural LM The statistical LM adopts maximum likelihood estimation, which is intuitive and easy to understand. However, there are still problems such as a lack of long-term dependence, the rapid growth of parameter space and sparse data. Therefore, the neural network is introduced to map the LM to a continuous space. Neural LMs use distributed representations of words to model natural language sequences. Unlike class-based N-gram models, neurolinguistic models are able to recognize two similar words without losing the ability to encode each word as different from the other. It can be directly used for NLP tasks. It mainly introduces Forward Feedback Neural Networks (FFNN), Recurrent Neural Networks (RNN), and pretrained LMs.

Fig. 15
figure 15

The model architectures of forward feedback neural network, recurrent neural network and Pretrained LMs. \(H^{1,2}\), \(H^{2,3}\) and \(H^{1,3}\) are the weight matrices used to connect each layer

As shown in Fig. 15a, FFNN according to the all former words of \(x=[w_{1}, \ldots , w_{i-1}]\) calculates the probability of \(w_{i}\). In order to predict the conditional probability of \(w_{i}\), x is sharing the projection matrix \(M \in R^{|V| \times m}\) to a continuous feature vector space according to the projection index, |V| is word library size, m is the dimension of the feature vector. The output is represented as

$$\begin{aligned} y=b_{2}+H^{1,3}_{x}+H^{2,3}_{x} \tanh (b_{1}+H^{1,2}_{x}), \end{aligned}$$
(18)

where \(H^{1,2}\), \(H^{2,3}\) and \(H^{1,3}\) are the weight matrices used to connect each layer, and \(b_{1}\) and \(b_{2}\) are the bias values of the hidden layer and the output layer respectively.

The structure of the FFNN contains only limited information about the foregoing and has some limitations on the length of the input sequence. Therefore, the RNN LM comes into being. As shown in Fig. 15b, RNN can accept input of any variable length. When the input window is moved, its internal state mechanism can avoid repeated calculation, and parameter sharing further reduces the number of model parameters. Therefore, compared with FFNN, RNN has a great advantage.

The pretraining LM is to get a set of model parameters by pretraining some tasks. It initializes the model with these parameters and then trains to improve the model performance effectively. The commonly used pretraining models are fixed embedding (Word2vec [12], Glove [69], etc), variable embedding (Embeddings from LMs (ELMO) [303], Generative pretrained Transformer (GPT) [50] and Bidirectional Encoder Representations from Transformers (BERT) [13], etc). Here, we give an example of the GPT model, as shown in Fig. 15c. It adopts a two-stage process. In the first stage, the Transformer decoder is used as the basic unit of the model to perform text prediction. In the second stage, the GPT is initialized differently for different downstream tasks, training the model and fine-tuning the parameters.

1.2 A.2 Basic components on GL

Due to the extensive use of graph data in many fields, some communities (e.g., chemistry, protein, and social network) have recently focused on the study of graph pretraining. These pretraining models encode graph attributes, structures, and other information into node representations from multiple perspectives by designing different pretext tasks, which are used to optimize downstream tasks. In this section, we introduce the definition of the basic concepts of graphs, and then provide a formal definition of the PFM on the graph.

1.2.1 A.2.1 Notations and definitions of graphs

Unless particularly specified, the notations used in this article are illustrated in Table 4. We use \(\mathcal {G}=\{G_{i}\}^{N}_{i}\) to represent a set of graphs, where N represents the number of graphs. Depending to the graph’s definition of the edges and nodes, graph data can be classified into the following types.

Definition 1

An unattributed graph is \(G=(V,E)\), where \(v \in V\) is a node, \(e \in E\) is an edge, and naturally \(E \subseteq V \times V\). Adjacency matrix \(A \in \mathbb {R}^{n \times n}\) represents the topology of graph G, where \(n=|V|\). \(A_{i,j}=1\) denotes there is an edge between node \(v_{i}\) and \(v_{j}\), otherwise \(A_{i,j}=0\).

Definition 2

An attributed graph is \(G=(V,E,X_{v},X_{e})\), where \(X_{v} \in \mathbb {R}^{n \times d_{v}}\) and \(X_{e} \in \mathbb {R}^{m \times d_{e}}\) are the feature matrices of nodes and edges, \(|V|=n\), \(|E|=m\), \(d_{v}\) and \(d_{e}\) denotes the feature dimensions of node and edge. In fact, in most application scenarios, only nodes have attributes, and edges have no attributes or only weights.

Definition 3

An undirected graph is \(G=(V,E)\), where \(e_{i,j} \in E\) means an unordered node pair \((v_{i}, v_{j})\). In particular, the adjacency matrix A of the undirected graph is a symmetric matrix (i.e., \(A_{i,j}=A_{j,i}\)).

Definition 4

A directed graph is \(G=(V,E)\), where \(e_{i,j} \in E\) means an ordered node pair \((v_{i}, v_{j})\).

Definition 5

G has a node-type mapping function \(f_{v}: V \rightarrow \mathcal {T}^{v}\) and an edge-type mapping function \(f_{e}: E \rightarrow \mathcal {T}^{e}\). When \(|\mathcal {T}^{v}|=|\mathcal {T}^{e}|=1\), the graph \(G=(V,E)\) is a homogeneous graph. In other words, all nodes in G belong to a type, and all edges also belong to one type.

Definition 6

When \(|\mathcal {T}^{v}|>1\) and/or \(|\mathcal {T}^{e}|>1\), the graph \(G=(V,E)\) is a heterogeneous graph. In particular, a heterogeneous graph must be an attributed graph.

1.2.2 A.2.2 Learning settings on graphs

GL methods are usually used to solve machine learning tasks on graph data, and we introduce different settings (supervision mode and learning mode) for GL.

Before that, we first provide the notations of the corresponding mathematical formulation of GL. \(C=\{c_{1}, c_{2}, \cdots , c_{K}\}\) is a set of target components defined in a graph set \(\mathcal {G}\) (\(G^{c_{i}} \in \mathcal {G}\)), and \(c_{i}\) is associated with a corresponding ground truth label \(y_{i} \in \mathcal {Y}=\{1,2,\cdots ,N_{y}\}\), where K denotes the total number of target components, and \(N_{y}\) is the number of classes being predicted. Then the graph data can be represented as \(D=\{c_{i}, G^{c_{i}}, y_{i}\}^{K}_{i}\), and a complete GL model \(M_{GL}\) can also be determined by \(y_{i} = M_{GL}(c_{i}, G^{c_{i}})\). For instance, in a node classification task, \(c_{i}\) is the node to be classified, \(y_{i}\) denotes \(c_{i}\)’s label in graph \(G^{c_{i}}\). Similarly, in a node clustering task, \(c_{i}\) is the node to be clustered, \(y_{i}\) denotes the corresponding cluster label in graph \(G^{c_{i}}\).

Supervision mode

Fig. 16
figure 16

Schematic of different supervision modes

Depending on the source and scale of the training data, the supervision settings of GL can be divided into four types as shown in Fig. 16. Supervised GL is the most common mode in the real scenario. Given the target component \(c_{i}\) and the corresponding ground truth label \(y_{i}\), the goal is to minimize the loss function between the predicted label of the GL model (i.e., \(y^{pred}_{i}=M_{GL}(c_{i}, G^{c_{i}})\)) and the expected label \(y_{i}\) of all \(c_{i}\). Compared with supervised learning, unsupervised GL refers to situations in which no labeled data is provided, only the attributes and structure distribution of graph data (i.e., \((c_{i}, G^{c_{i}})\)) can be used. Self-supervised GL is a special case of both supervised and unsupervised learning. Specifically, self-supervised learning mainly uses pretext tasks (e.g., clustering, completion, and partition) to mine its own supervised information (i.e., pseudo-labels) from large-scale unsupervised graph data, and trains the GL model \(M_{GL}\) through the self-supervised information, so that it can learn to the valuable features of downstream tasks. In other words, the supervised information of self-supervised learning is not manually labeled, but the pretext tasks automatically construct supervised information from large-scale unsupervised data for supervised learning or training. Semi-supervised learning is a combination of unsupervised and supervised learning, which aims at learning data distribution to predict unlabeled data to solve the problem of difficulty in obtaining labeled data in real scenarios. In GL, semi-supervised learning refers to the realization of pattern recognition given a few labeled data and mass unlabeled data.

Learning mode The GL model \(M_{GL}\) is optimized by the given training samples, and adjusted on the validation samples to participate in the test. According to the visibility of the graph data at different stages, the learning settings of GL model \(M_{GL}\) can be classified into two categories: inductive learning and transductive learning.

Definition 7

Inductive Learning, which is the most common setting in machine learning tasks, trains the model on labeled data and then tests on samples that have never appeared in the training stage. Formally, given a training sample \(\{(c_{i}, G^{c_{i}}, y_{i})\}^{N_{l}}_{i=1}\), \(\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}\), where \(N_{l}\) and \(N_{u}\) are the numbers of labeled/unlabeled samples. Inductive learning learns a function \(f^{ind}: \mathcal {G} \mapsto \mathcal {Y}\) so that \(f^{ind}\) is expected to be a good classifier on the future graph data \(\{(c_{k}, G^{c_{k}})\}\), beyond \(\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}\).

Definition 8

Transductive Learning is different from inductive learning in that all samples are visible during both the training and testing stages. Formally, given a training sample \(\{(c_{i}, G^{c_{i}}, y_{i})\}^{N_{l}}_{i=1}\), \(\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}\), transductive learning learns a function \(f^{trans}: \mathcal {G}^{l+u} \mapsto \mathcal {Y}^{l+u}\) so that \(f^{trans}\) is expected to be a good classifier on the unlabeled data \(\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}\).

Under the supervised setting (including semi-/self-supervised), the unified classifier optimization methods of inductive learning and transductive learning can be written as:

$$\begin{aligned} \mathcal {L}= \frac{1}{K} \sum ^{K}_{i=1} \mathcal {L}(f^{(\cdot )}_\theta (c_{i}, G^{c_i}), y_i), \end{aligned}$$
(19)

where \(\mathcal {L}\) is the cross-entropy loss, \(c_{i}\) can be node, edge or subgraph of its associated graph \(G^{c_{i}}\), and \(f^{(\cdot )}_\theta\) denotes inductive/transductive function with parameter \(\theta\).

Compared with using only one pretext task, some methods have designed some integration mechanisms to incorporate the advantages of multiple pretext tasks into a unified framework.

B Traditional learning methods

1.1 B.1 Traditional text learning

NLP is a research field that integrates linguistics and computer science. Its main research tasks include part-of-speech tagging, named entity recognition, semantic role labeling, machine translation, question answering, sentiment analysis, text summarization, text classification, relationship extraction, event extraction, etc. The LM can be considered the cornerstone of the downstream NLP tasks. It experiences four processes: grammar rule LM, probabilistic LM, neural network LM, and pretraining LM. A PFM trains on a large benchmark dataset to obtain a model that can solve new similar tasks, which has become a new hotspot in current LM research.

Word representations play a significant role in downstream tasks, which is the basis of NLP. The N-gram model preprocesses text features and encodes adjacent N words as a group, which makes it overly dependent on the richness of the training corpus. Otherwise, data-sparse is likely to occur, and the computational complexity will increase exponentially with the increase of N. Neural Network LM (NNLM) [11] adopts the idea of word vector for the first time, and the low-dimensional word vector of distributed representation can solve the discrete problem caused by word embedding well. However, it is still challenging to solve the problem of high computational complexity. The computational complexity of the word2vec model is independent of the selected window size but is determined by the dictionary size and the word vector dimension. Many downstream tasks can be significantly improved by training on a large corpus using word vector embedding after initial training. However, the problem of polysemy for the static word vector is still unsolved, and it still belongs to the shallow LM [304, 305]. Therefore, more effective models are urgently needed to deal with the data set more flexibly. To capture high-level concepts of context, such as polysemy elimination, syntactic structure, etc. Neelakantan et al. [306] propose to learn multiple embeddings per word type. Zhou et al. [307] integrate the features on both dimensions of the matrix to enrich semantics by using subword information. Based on the Continuous Bag Of Words (CBOW) [12] in word2vec, Hui et al. [308] fine-tune the generated word vectors for emotion and obtain the word vectors containing both semantic meaning and emotional tendency, which significantly improved the performance in the Weibo sentiment classification task. Liu et al. [309] propose a model of hierarchical translation for machine translation. It uses the neural LM based on RNN as the word vector generation model. Liang et al. [310] propose an approach based on the double-layer self-attention mechanism for machine reading comprehension, and the model is divided into three parts: single document encoder, multi-document encoder, and answer prediction. In the single document encoder, the problem of the context information is represented by the Gated Recurrent Unit (GRU) model. Zhang et al. [311] propose an INDependent RNN (INDRNN) and attention mechanism for user intention classification, using word vectors generated by word2vec as input. The model introduces a word-level attention mechanism to effectively quantify the contribution of domain vocabulary to the intention category.

1.2 B.2 Traditional image learning

There are several types of neural networks in the deep learning era, from the beginning of most famous convolutional neural networks (CNNs) to the subsequent Attention- and Transformer-based networks. A deep neural network refers to an artificial neural network with more hidden layers, and more parameters are used to represent the target model, which leads to the SOTA performance on the benchmark dataset from image to video. Here, we introduce the milestone networks in CV chronologically.

1.2.1 B.2.1 Convolution-based networks

ImageNet [312], as one of the most important databases in computer vision, has aroused many milestone network architectures in image classification, including AlexNet [313], NIN [314], VGG [315], GoogLeNet [316], ResNet [317], DenseNet [318], etc. When it comes to object detection and semantic segmentation, researchers explore R-CNNs [319,320,321,322], FCN [323], SSD [324], YOLOs [325,326,327,328,329], SegNet [330], PSPNet [331], Deeplabs [332,333,334,335], RefineNet [336], etc. on common benchmark datasets, such as PASCAL VOC [337, 338], MS COCO [339], etc.

There are several shared features among these popular convolution-based networks: (1) data augmentation. Deep models require much more data to fit a complicated model, thus the data augmentation technique such as flipping, rotation, cropping, scaling, translation, and even adding noises enlarges the training dataset; (2) convolution. The convolutional kernel is used to extract the features of original image data, which maintains the spatial structure for the adjacent pixels; (3) deep architecture. The deep architecture contains more parameters, which enhance the capability of the model. These common features have contributed to the SOTA performance of convolutional neural networks (CNNs) in computer vision for nearly recent 10 years.

1.2.2 B.2.2 Recurrent neural networks

Different from CNNs targeting 2D-dimensional image applications, recurrent neural networks (RNNs) [340,341,342] try to use recursive cells to process pictures in sequence, i.e., video data. However, the weaknesses of gradient explosion and long-term dependencies restrict further development of this model. To handle these problems embedded inside the RNN-based models, long short-term memory (LSTM) [343] was proposed by Hochreiter and Schmidhuber in 1997. In addition, the improved capability of LSTMs produces popularity and attracts attention both in NLP and CV [344,345,346,347,348].

1.2.3 B.2.3 Generation-based networks

Generative Adversarial Networks (GANs) [349] have provided a paradigm to learn representations for unlabelled data, and spawn many GAN-based approaches on downstream tasks. In image translation, pix2pix software [350] first proposes the conditional adversarial networks as a solution to the image-to-image translation problems, and achieves reasonable results on real-world datasets. Markovian Generative Adversarial Networks (MGANs) [351] is a method to generate texture synthesis, which can be applied to style transfer and video stylization. CycleGAN [352] provides a learning algorithm to translate an original image from the source domain to a target domain without containing pairs of images in datasets for supervised learning. StyleGAN [353] is a style-based generator to serve as an alternative architecture for traditional GANs. Pixel Recurrent Neural Networks (PixelRNN) [354] aims to complete images by modeling full dependencies between the color channels. DiscoGAN [355] is designed to learn relations between different domains.

GANs have also provided a novel direction to study data synthesis because it perfectly simulates the distribution of the original data. Laplacian Pyramid of Adversarial Networks (LAPGAN) [356] uses a cascade of convolutional networks to generate images in a coarse-to-fine fashion. Similarly, Stacked Generative Adversarial Networks (SGAN) [357] decompose variations into multiple levels and gradually resolve uncertainties by stacking several GANs in a top-down way.

1.2.4 B.2.4 Attention-based networks

Based on the success of CNNs in the area of CV, the attention module is designed to be equipped with the popular CNNs. For example, SENet [358] proposes a channel attention module, which won first place in the competition of ILSVRC2017. In addition, CBAM [359] sequentially infers attention maps along both channel and spatial dimensions. Many innovative works, such as GCNet [360] and CCNet [361], are inspired by this idea of soft-attention mechanism, which outperforms the traditional CNNs on major benchmarks for both recognition and segmentation tasks. In particular, the self-attention mechanism [362], calculating the response at a position among all entities in a sequence by attending to all positions within the same sequence, is proposed to estimate the relevance of one position to other positions in feature maps. To control the expected entities and model more complex relations among different elements in the sequence, masked self-attention and multi-head attention [38] are the key components proposed to substitute the function of convolutions in the era of transformers.

1.2.5 B.2.5 Transformer-based networks

Recently, inspired by the self-attention mechanism and subsequent success of the transformer in NLP, researchers in CV also have tried to use the transformer as an alternative to the convolution. Self-attention-based transformer models always operate in a two-stage training mechanism: (1) pretraining on a primitive dataset (always big but not well labeled) by defining pretext tasks; (2) transferring the pretrained weights to the downstream tasks and adjusting the parameters on the target domain dataset by finetuning. Vision Transformer (ViT) [40] is applied on CV and achieves the SOTA performance on major benchmark datasets. Data-efficient image Transformers (DeiT) [363]was proposed by Facebook AI to train image transformers more efficiently and maintain the SOTA performance simultaneously. DEtection TRansformer (DETR) [364] significantly outperforms competitive baselines in both object detection and semantic segmentation. LeViT [365] outperforms existing benchmarks with respect to balancing the accuracy and training speed. Image GPT [149] is inspired by a sequence transformer in NLP, which can compete with several self-supervised benchmarks on ImageNet. On the basis of this research, DeepViT [366] explores a deeper architecture to improve performance consistently by making the transformer go deeper. Moreover, many researchers try to apply the transformer to more specific tasks. Pyramid Vision Transformer (PVT) [367] introduces the pyramid structure to overcome the difficulties of porting the transformer to various dense prediction tasks, and achieves the SOTA performance on major benchmark datasets. M3DeTR [368] is a novel research on multi-representation, multi-scale, and mutual-relation 3D object detection with transformers. Medical Transformer (MedT) [369] has focused on medical image segmentation and outperforms previous CNN-based and transformer-based architecture. In conclusion, the transformer has become a novel and popular research area in CV and its performance is proved by many existing works.

1.3 B.3 Traditional graph learning

GL aims to embed the graph as a low-dimensional representation while preserving the desired properties of the original graph data. Classical GL methods are usually implemented using statistical methods or artificially designed components.

Dimension reduction As a commonly used method in feature engineering, dimension reduction aims to reduce the dimension of high-dimensional attribute graph data into a lower-dimensional representation. In GL, it highlights the remaining information at the cost of losing part of the attributes. According to different dimensionality reduction strategies, such methods can be classified into two types. The first type is subspace learning under the linear assumption. Based on the assumption that the principal components [370] related to the larger variance represent important structural information, and those smaller variances represent noise, principal component analysis calculates a low-dimensional representation that maximizes the variance of the data. Linear Discriminant Analysis (LDA) [371] achieves dimension reduction by maximizing the ratio of inter-class scattering and intra-class scattering to obtain a linear projection matrix. Multi-Dimensional Scaling (MDS) [372] is a distance-maintaining manifold learning method. It produces a mapping in a lower dimension to preserve dissimilarities between nodes as much as possible. The second type is nonlinear dimension reduction, which aims to automatically learn nonlinear topology to achieve manifold learning. Isomap [373] first constructs a neighborhood graph on the manifold and calculates the shortest path between pairs of nodes, and then uses MDS to construct a low-dimensional embedding. Locally Linear Embedding (LLE) [374] first allocates neighbors for each node. Then, it calculates the weighted \(W_{i,j}\), the best linear reconstruction feature \(X_{i}\) from its neighbors. Finally, calculate the low-dimensional embedding for the optimal reconstruction of \(W_{i,j}\).

Matrix factorization Greatly influenced by the idea of dimension reduction, the models based on matrix factorization emerged in the early research of GL. Such models aim to reconstruct the adjacency matrix of the graph to achieve dimension reduction while maintaining structural information. Although these models have significant limitations, in fact, their ideas still inspire many current studies. Depending on how the matrix is constructed, such methods often append specific constraints. Graph Laplacian eigenmaps [375] minimizes a loss function to ensure that nodes close to each other on the manifold are mapped into the low-dimensional space and still maintain the local distances. Node proximity matrix factorization [376] minimizes the objective function \(|W-YY^{cT}|\) through matrix factorization to approximate the proximity of nodes in the low-dimensional space, where Y and \(Y^{c}\) are the embeddings for nodes and context nodes, and W is the default node proximity matrix. GraRep [377] aims to preserve the high-order proximity of graphs in the embedding space, thus it derives a k-th order transition matrix, \(A^{k}\), by multiplying the adjacency matrix to itself k times. The transition probability from node \(v_{i}\) to node \(v_{j}\) is the entry in the i-th row and j-th column of the k-th order transition matrix, i.e., \(p_{k}(v_{i}|v_{j})=A^{k}_{i,j}\). Then GraRep defines the loss function using the skip-gram model and negative sampling. To capture the high-order proximity between node pairs, HOPE [378] preserves asymmetric transitivity in approximating the high-order proximity. Specifically, the goal of HOPE is to minimize the objective function \(||S-WC^{T}||^{2}_{F}\), where the elements \(s_{i,j} \in S\) represent a certain edge feature (e.g., Katz index, the Rooted Page-Rank, the Common Neighbors, and the Adamic-Adar) between the corresponding node pairs \((v_{i}, v_{j})\), W is the node representation matrix, and C is the embedding of the node as the context. To reconstruct the matrix S more simply and elegantly, HOPE proposes to obtain W and C directly based on the low-rank singular value decomposition (SVD).

Graph kernel The kernel method is an important algorithm in pattern recognition and machine learning. Its basic idea is to give the graph embedding \(x \in X\) in the original low-dimensional space X, and maps the embeddings to a high-dimensional feature space H through a nonlinear function \(f^{ker}\). Then the nonlinear problem in X can be solved by constructing a linear algorithm in H. There are two main types of kernel methods on graph data. The first type uses the embedding method to convert the graph data into vectorial representation, and then directly implements the application based on the kernel function. However, due to the loss of mass graph structure information when transforming graphs into vectorial representation, such methods do not perform well in real scenarios. The second type of method introduces the graph kernel function to solve this problem. Based on retaining the advantages of the original kernel function, it directly represents the structural information of the graph data in the high-dimensional Hilbert space. The definition of the traditional method of graph kernel comes from R-convolution. According to the difference between the contrast substructure and the decomposition method of the graph structure, a large number of methods based on graph kernel have been proposed. For example, the work of [379, 380] proposed a random-walk kernel based on calculating the number of common synchronization between two graph structures, To reduce the computational complexity and optimize the random walk strategy, a graph kernel based on comparing the shortest path information between two graph structures is proposed. To capture more complex topological information, the Weisfeiler-Lehman subtree graph kernel is proposed, which is based on a one-dimensional Weisfeiler-Lehman isomorphism test algorithm to find isomorphic subtree structures in a bunch of graph structures [381].

C PFMs theory

Since pretraining has received great attention from the research community, the investigation in the theory-backed explanation is similarly eye-catching. During the unsupervised pretraining era before SSL, Erhan et al. [382, 383] shed some light on the theoretical explanation for the confirmation and clarity of learning difficulties. [382] researches the influence of pretraining with respect to architecture depth, model capacity, and the number of training samples, and demonstrates the robustness of pretraining from the perspective of both the optimization and the regularization. [383] further prove the regularizer role of the unsupervised pretraining in the downstream supervised tasks.

1.1 C.1 Different perspectives

Pretext tasks. [384] posits a mechanism based on approximate conditional independence (CI) to connect pretext and downstream task data distributions, which suggests that pretext tasks can self-supervisedly learn the representations from unlabelled data that reduce the sample complexity of downstream supervised tasks. The experiments both on CV and NLP tasks support this theory. Representation Learning via Invariant Causal Mechanisms (ReLIC) [181] also provides a theoretical understanding from the perspective that the explicit invariance constraints across augmentations can yield improved generalization guarantees.

Multi-view redundancy. From the perspective of a multi-view setting, [385] understands contrastive learning as exploiting multiple views of data for representation learning. This theory provides a theoretical analysis that the linear functions of these representations from pretraining are still competitive compared with the non-linear optimal predictor of the label. In other words, the linear functions of the learned representations are nearly optimal on downstream prediction tasks whenever the different views provide redundant information about the label.

1.2 C.2 Different categories

Contrastive learning. Although experimental results show us that previous designs such as contrastive loss or momentum updating can produce impressive performance in SSL. However, one of the most important questions that remain in SSL is why these methods can maintain representation consistency during the pretraining process. A naive view is the minimization between positive pairs can boost invariance learning, while the maximization between negative pairs contributes to avoiding representational collapse. [386] shows that contrastive learning can achieve competitive bound via intra-class concentration, thus leading to the reduction of sample complexity on downstream tasks from the benefit of transferred representations. This research also provides a framework that can be utilized both on the guarantees of the quality of learning representations during the pretraining phase and the future assumptions added to the framework that allow tighter guarantees.

Non-contrastive learning. While contrastive learning shows an effect by capturing the similarity and dissimilarity among the unlabelled examples, and further converging to an average local optimum which represents the general representations, recent non-contrastive SSL methods such as BYOL and SimSiam also shows the SOTA performance without the design of comparison between negative pairs. Based on the analysis of the eigenspaces, Tian et al. [182] study the behavior of non-contrastive SSL training and prove that the effects are from both the predictor and stop-gradient signal. Based on this theory, a novel and simple DirectPred method is proposed as a by-product of this theoretical exploration.

D Pretext task taxonomy on CV

Pretext tasks are always designed to use pseudo labels generated from the data itself to pretrain the proxy model. There are five categories of pretext tasks for self-supervised: (1) generation-based methods; (2) transformation-based methods; (3) context-based methods; (4) semantic-based methods; (5) view-based methods.

Generation-based methods This type of method is GAN-based in the deep learning era. For image generation, there are several applications including image colorization [138, 387], image super-resolution [388], image editing [389], context encoders [137], image-to-image translation [352], etc. On the other hand, video generation tasks contains future prediction [145], video action recoginition [241], video generation [390, 391], and video representaion [392].

Transformation-based methods Transformation is a typical technology that serves as a data augmentation method to enlarge the training dataset in traditional DL. However, if transformations of the same image are labeled as positive samples and others as negative samples, this pretext task can be used for self-supervised pretraining [166]. Popular transformation in self-supervised learning (SSL) contains color transformation (such as Jitter, Gaussian blur, and adjusting brightness) and geometric transformation (such as flipping, cropping, scaling, and rotation).

Context-based methods Basically, the design and construction of many artificial tasks, such as solving Jigsaw puzzles [140], comparing context similarity, and discriminating sequence order. Solving Jigsaw puzzles is defined as identifying the correct position of patches from an image. This task can help the model to learn an encoder for transfer learning [141, 393], and the feature representations are effective after the pretrained dataset is big enough. In addition, the design of video Jigsaw is also proposed for unsupervised learning [394]. Differently, context similarity tries to label the patches from the same images as positive samples and others as negative samples, then use a predefined similarity function to scale the distance between different pairs [49].

Semantic-based methods Semantic-based methods contain object detection, semantic segmentation, and depth prediction. These tasks also involve pretext tasks because their pixel-based labels can learn a more robust feature representation than simpler tasks. These pre-text tasks always establish on video dataset [395, 396].

View-based methods This type of method contains both single-modal data and multi-modal data. For the single-modal data, the original data is treated as the anchor and different viewpoints generate its positive pair samples. Sometimes the time slices in sequence-based data are treated as negative pairs because the scene is changed as time goes [397]. In addition, multi-modal data is usual in view-based methods, which are also called cross-modal-based methods here. Such as audio-video cooperative learning [398], RGB and optical flow cross-modal distance training [250].

E Evaluation metrics

Classification task The classification task, according to a labeled training document, determines the relationship between document features and document categories. The learned relationship model is then used to determine the category of new documents.

Accuracy and error rate The key metrics for a text classification model are Accuracy and Error Rate. The terms Accuracy and Error Rate are defined as follows:

$$\begin{aligned} Accuracy =\frac{(\textrm{TP}+\textrm{TN})}{N}, \end{aligned}$$
(20)
$$\begin{aligned} Error Rate = 1 - Accuracy =\frac{(\textrm{FP}+\textrm{FN})}{N}, \end{aligned}$$
(21)

where \(\textrm{TP}\) and \(\textrm{FP}\) denote true positive and false positive, \(\textrm{TN}\) and \(\textrm{FN}\) stand for true negative and false negative.

Precision, Recall and F1 Regardless of the standard type and error rate, there are very important metrics used for unbalanced testing sets. These metrics are similar to the concept of the class label in the testing samples. F1 is defined as the harmonic average of Precision and Recall. Thus, Accuracy, Recall, and F1 can be represented as:

$$\begin{aligned} Precision =\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}, \quad Recall =\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \end{aligned}$$
(22)
$$\begin{aligned} F1 =\frac{2 \mathrm { Precision \times Recall }}{\textrm{ Precision }+\textrm{Recall}}. \end{aligned}$$
(23)

When the accuracy, F1, and recall values hit 1, the desired results are obtained. On the other hand, when the values turn 0, we get the worst consequence. For the multi-class classification task, the precision and recall values of each class can be determined independently, and then the individual and overall performance can be analyzed.

Micro-F1 The Micro-F1 [399] is a metric that measures all labels’ overall accuracy and recall. We denote \(Micro\text {-}F1\) as:

$$\begin{aligned} Micro\text {-}F1=\frac{2 \textrm{P}_{t} \times R_{t}}{\textrm{P}+\textrm{R}}, \end{aligned}$$
(24)
$$\begin{aligned} P=\frac{\sum _{t \in \mathcal {S}} T P_{t}}{\sum _{t \in S} T P_{t}+F P_{t}},\quad R=\frac{\sum _{t \in S} T P_{t}}{\sum _{t \in \mathcal {S}} T P_{t}+F N_{t}}. \end{aligned}$$
(25)

where \(T P_{t}\) and \(F P_{t}\) mean true and false positive of the t th label on a text.

\(Macro\text {-}F1\) The \(Macro\text {-}F1\) calculates the average F1 of all labels by giving equal weight to them. \(Macro\text {-}F1\) is denoted as:

$$\begin{aligned} {Macro}\text {-}F1=\frac{1}{\mathcal {S}} \sum _{t \in \mathcal {S}} \frac{2 \textrm{P}_{t} \times R_{t}}{\mathrm {P_{t}}+\mathrm {R_{t}}}, \end{aligned}$$
(26)
$$\begin{aligned} P_{t}=\frac{T P_{t}}{T P_{t}+F P_{t}},\quad R_{t}=\frac{T P_{t}}{T P_{t}+F N_{t}}. \end{aligned}$$
(27)

where \(T N_{t}\) and \(F N_{t}\) represent true and false negative of the t th label. \(\mathcal {S}\) stands for the label set of all samples.

Mean reciprocal rank (MRR) The MRR is commonly used to evaluate the performance of ranking algorithms on Question Answering (QA) and Information Retrieval (IR) tasks. MRR is represented as

$$\begin{aligned} \textrm{MRR}=\frac{1}{Q} \sum _{i=1}^{Q} \frac{1}{{rank}_{i}}, \end{aligned}$$
(28)

where \({rank}_{i}\) is the ranking of the i th ground-truth answer. The number of predicted labels on each text is denoted by Q. Moreover, there are some metrics, such as EM, Hamming-loss [400], P@K and NDCG@K.

Generation task Generation task uses LMs to predict the next most likely word or sentence based on input data.

Bilingual evaluation understudy (BELU) BLEU compares the generated sentences to the reference sentence and makes predictions using automatic machine translation algorithms. The language creation problem is also supported by deep learning technologies such as speech recognition, image caption generation, and text summarization. They can’t discover anything better, but it has a few advantages: it’s simple to comprehend, correlates well with human judgment, and is language-independent. As a bilingual evaluation aid, BLEU is mainly used to evaluate the quality of machine translation [401]. BLEU compares the degree of overlap between the N-gram in the candidate text and the N-gram in the reference text. The higher overlap indicates better translation quality. The formula for the computation is:

$$\begin{aligned} B L E U=BP \times {\text {exp}}\left( \sum _{n=1}^{N} W_{n} \log P_{n}\right) , \end{aligned}$$
(29)

where N represents N-gram, BP is penalty factor, \(P_{N}\) is multivariate precision, and \(W_{N}=1/N\) is the corresponding weight of multivariate precision. r represents the length of the shortest reference translation, and c represents the length of the candidate translation, then the specific calculation method of penalty factor BP is as follows:

$$\begin{aligned} BP= {\left\{ \begin{array}{ll}1, & l_{t}>l_{a} \\ e^{1-l_{a} / l_{t}}, & l_{t} \le l_{a}\end{array}\right. }, \end{aligned}$$
(30)

where \(l_{t}\) is the number of words in machine translation and \(l_{a}\) is the number of words in reference answer. The penalty factor is mostly used to penalize large gaps between machine and reference translations.

ROUGE (recall-oriented understudy for gisting evaluation) ROUGE stands for N-gram co-occurrence statistics, which are used in automatic evaluation methods. It is expanded on the similarity of N-grams, which means that an N-gram is a subsequence of the main document text in terms of N words. There are four types of ROUGE, including ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. The first two are commonly used, and the N in rouge-N refers to N-gram, which is calculated similarly to BLEU, except BLEU is based on accuracy, while ROUGE is based on recall. L in ROUGE-L refers to the Longest Common Subsequence, which is calculated as the Longest Common Subsequence between the candidate abstract and the reference abstract. Thus, the longer the length, the higher the score, based on the F value. The calculation formula of ROUGE-N and ROUGE-L is mainly introduced. The calculation formula of ROUGE-N is as follows:

$$\begin{aligned} ROUGE-N=\frac{\sum _{S \in \{\text{ ReferenceSummaries }\}} \sum _{\text{ gram}_{n} \in S} \text{ Count}_{\text{ match }}\left( {\text {gram}}_{n}\right) }{\sum _{S \in \{\text{ ReferenceSummaries }\}} \sum _{\text{ gram}_{n} \in S} \text{ Count }\left( \text{ gram}_{n}\right) }, \end{aligned}$$
(31)

where N stands for N-gram, \(Count(gram_{n})\) represents the frequency of occurrence of an N-gram, and \(Count_{match}(gram_{n})\) represents the frequency of co-occurrence of an N-gram. The calculation formula of ROUGE-L is as follows:

$$\begin{aligned} ROUGE-L=F_{lcs}=\frac{\left( 1+\beta ^{2}\right) R_{\textrm{lcs}} P_{\textrm{lcs}}}{R_{1 \textrm{cs}}+\beta ^{2} P_{\textrm{lcs}}}, \end{aligned}$$
(32)
$$\begin{aligned} R_{\textrm{lcs}}=\frac{L C S(X, Y)}{M}, \end{aligned}$$
(33)
$$\begin{aligned} P_{\text{ lcs } }=\frac{L C S(X, Y)}{N}, \end{aligned}$$
(34)

where X is the candidate abstract, Y represents the reference abstract, LCS(XY) table indicates the length of the Longest Common Subsequence (LCS) of the candidate abstract and references abstract, M stands for the length of reference abstract, and N denotes the length of the candidate abstract. The ROUGE method is characterized by N-gram co-occurrence statistics, based on recall rate (ROUGE-N) and F-value (ROUGE-L). They are often used in text summaries. It is worth noting that ROUGE is word-based correspondence rather than semantic-based correspondence, but this can be mitigated by increasing the number of reference summaries.

METEOR METEOR, also known as an explicitly sorted translation evaluation metric [402], is an improved version of the BLEU standard that aims to address some flaws in the BLEU standard. Using WordNet to calculate matching relationships between specific sequences, synonyms, roots, affixes, and definitions improves BLEU performance and makes it more relevant to manual discrimination. The calculation formula is as follows:

$$\begin{aligned} METEOR=(1-Pen) \times F_{\text{ m }}, \end{aligned}$$
(35)
$$\begin{aligned} F_{\text{ m }}=\frac{P R}{\alpha P+(1-\alpha ) R}, \end{aligned}$$
(36)
$$\begin{aligned} P=\frac{m}{\sum _{k} \text{ h}_{k}(c_{i})}, \end{aligned}$$
(37)
$$\begin{aligned} R=\frac{m}{\sum _{k} \text{ h}_{k}(s_{ij})}, \end{aligned}$$
(38)

where \(Pen=\gamma (\frac{ch}{m})^{\theta }\) is a penalty factor, which punishes the word order in candidate translation that is different from that in reference translation. ch refers to the number of chunks, which are clustered units of matched units adjacent to each other in both the candidate translation and the candidate reference translation. \(\alpha , \beta , \theta\) is the adjustable parameter, m is the number of unary groups that can be matched in the candidate translation, c is the length of the candidate translation, \(h_{k}(c_{i})\) is the number of occurrences in candidate translations \(c_{i}\), and \(h_{k}(s_{ij})\) is the number of occurrences in reference translations \(s_{ij}\).

Perplexity Perplexity is also called the degree of confusion [403]. Its core idea is: first, according to the testing sentence, learn an LM P. Then, according to the LM P, the score of the optional sentence is calculated. Finally, the above scores are standardized according to sentence length. The calculation formula is as follows:

$$\begin{aligned} P P L(W)=P\left( w_{1}, w_{2}, \ldots , w_{M}\right) ^{-\frac{1}{M}}, \end{aligned}$$
(39)

where W is the candidate translation, M is the length of the candidate translation, P is the LM obtained according to the reference translation, and \(P(w_{1}, w_{2}, \ldots , w_{M})\) is the score calculated by the LM for the candidate translation. The Perplexity assessment indicator is based on a LM. The lower the degree of confusion, the better the translation quality, which is often used in machine translation and LMs. Its disadvantages are as follows: the larger the data set is, the faster the degree of confusion decreases; the punctuation in the data will impact the PPL of the model; and the interference of common words.

F Datasets

1.1 F.1 Downstream tasks and datasets on NLP

There are many available data sets in the NLP domain, divided according to different tasks. It mainly comprises two categories: the task of classification of texts and the task of generating texts. The text classification tasks mainly include Sentiment Analysis (SA), News Classification (NC), Topic Labelling (TL), Natural Language Inference (NLI), Named Entity Recognition (NER), Question Answering (QA), Dialogue Act Classification (DAC), etc. The generation tasks mainly include text summaries and machine translation (Table 5).

Table 5 The statistics of the datasets on NLP. For the QA task, the class represents the sum number of candidate answers and the correct answer. For dialogue, class is the number of slots. Length means the average tokens in turn

Sentiment analysis (SA)

It consists of judging the emotional polarity and dividing it into several classes. Depending on the granularity of sentiments, the SA is divided into three categories: dichotomy (positive and negative), trichotomy (positive, negative, and neutral), and multiple categories. Here we introduce several datasets in detail.

Stanford sentiment treebank (SST) [474] The dataset is an extension of MR [475]. SST-1 is a version of SST. It is divided into five categories and the number of training texts and testing texts is 8544 and 2210, respectively. It also consists of 20 average tokens. The SST-2 [409] contains 9613 movie reviews including 6920 training texts, 872 development texts, and 1821 testing texts.

Semantic textual similarity benchmark (STS-B) [476] It is used in semantic textual similarity tasks organized in the SemEval context between 2012 and 2017 [477]. It consists of text from image titles, news titles and forums. On a scale of 1 to 5, STS-B displays the semantic similarity of two sentences. It includes 5749 training sets, 1379 development sets, and 1377 testing sets.

Multi-perspective question answering (MPQA) [478, 479] This is an opinion dataset which has two categories. It contains 10,606 sentences from various news sources that have been manually annotated for opinions and other private states. It is worth noting that there are 3311 positive articles and 7293 negative articles, having no labels for each article.

IMDB reviews [480] The dataset is the world’s most authoritative source for binary sentiment classification of film reviews. The number of content in each class is the same and it can be divided into training and testing sets whose number of comments is 25,000 on average.


News classification (NC)

As one of the most vital information sources, news content exerts a critical effect on people. The NC facilitates users to acquire essential knowledge in real time. Its applications mainly include news topic identification and recommendation of relevant news based on user interests. Here we introduce several datasets in detail.

20 Newsgroups (20NG) [481] 20NG is a text dataset derived from newsgroups. There are 20 classes with the same number of articles per class, including 18,846 articles in total. The average number of tokens is 221.

AG news [424, 482] This is an academic news search engine, which is divided into four categories. It contains news headlines and introductions. It includes 120,000 training texts and 7600 testing texts. The number of average tokens is 45/7.

R8 and R52 [483] They come from Reuters [484]. R8 contains 8 classes consisting of 66 average tokens and includes 2189 and 5485 testing and training courses. There are 52 classes in R52, which consists of 70 average tokens. It is divided into 6532 and 2568 training and testing texts.


Topic labeling (TL)

The task mainly obtains the meaning of the file by defining complex file themes. It is a critical component of topic analysis technology, which aims at simplifying topic analysis by assigning each article to one or more topics. Here, we introduce a few in detail.

DBpedia [485] It is a large-scale multilingual knowledge base generated by Wikipedia’s most commonly used information boxes. It releases DBpedia every month, adding or removing classes and attributes in each version. The most popular version of DBpedia has 14 categories, separated into 560,000 training data and 70,000 testing data. The number of average tokens is 55.

Ohsumed [486] This is a biomedical literature database. The number of texts is 7400. It has 23 cardiovascular disease categories and consists of 136 average tokens. All texts are medical abstracts that are categorized into one or more classes.

Yahoo answers (YahooA) [424] The dataset is a topic labeling task having 10 categories. The number of average tokens is 136. There are 140,000 training data and 5000 testing data. Each text in YahooA has question titles, question contexts, and best answers.


Natural language inference (NLI)

This task is used to forecast whether the meaning of a text can be inferred from another. Interpretation is a broad form of NLI. By comparing the semantic similarity of sentence pairings, it determines whether a sentence is the interpretation of another one. Here we introduce several primary datasets in detail.

The Stanford natural language inference (SNLI) [430] It is commonly used in NLI takes. It contains 570,152 human-annotated sentence pairs, which are annotated with three sorts of relationships: neutral, derived, and conflicting. Multi-genre Natural Language Inference (MNLI) [487] has 3 categories and consists of 430,000 sentence pairs annotated with textual information, which is usually used in textual inference tasks. Question Natural Language Inference (QNLI) [488], whose task with 2 classes is to determine whether a given text pair is a question-answer. Winograd Natural Language Inference (WNLI) [489] which consists of 2 categories is a data set that captures the standard reference information between two paragraphs.

Microsoft research paraphrase (MSRP) [435] The dataset contains sentence pairs for the text-similarity task, including 1725 training and 4076 testing sets. A binary label annotates each pair, discriminating whether they are paraphrases.

Sentences involving compositional knowledge (SICK) [434] It includes nearly 10,000 English sentence pairs, marked with similarity, and the scale range is 1–5. It has neutral, entailment, and contradictory three categories.


Named entity recognition (NER)

This is a fundamental task of NLP to identify people, places, organizations, and other entities in text. It is a crucial primary tool for many NLP tasks, including information extraction, question answering, semantic parsing, machine translation, etc.

CoNLL 2003 [303] It consists of newswire text from the Reuters RCV1 corpus. It contains four different entity types (Location, Organization, Person, and Miscellaneous) and includes 1393 English news articles, and 909 German news articles.

OntoNotes 5.0 [13] The dataset consists of 174,5K English, 900K Chinese, and 300K Arabic text data. It comes from telephone conversations, news agencies, radio news, radio conversations, and blogs. It has 18 entity classes containing 11 types, seven values, and 2,945,000 text data.

MSRA [440] This is a Chinese dataset that is obtained from the news domain. It has three types of entities and is used as a shared task on SIGNAN back in 2006.


Question answering (QA)

There are two types of QA systems: the extraction guidance system and the generation guidance system. The extractive QA can be regarded as a particular case of text classification. Here we detail several datasets.

Microsoft research paraphrase corpus (MRPC) [490] It contains 5800 sentence pairs extracted from Internet news, and the task type is similar to the QQP dataset. Sentence pairs are derived from comments on the same news item and determine whether the two sentences are semantically the same. The assessment criteria were classification accuracy and F1 score.

Stanford question answering dataset (SQuAD) [303] This is a large-scale machine-reading comprehension dataset that contains two tasks. SQuAD 1.1 [488] provides questions and corresponding answers, and the data set contains 100,000 samples in total, while SQuAD 2.0 [491] adds unanswered questions and expands the scale to 150,000.

RACE [492] The dataset has 5 categories, containing nearly 100,000 questions extracted from middle and high school English tests, with corresponding answers given by experts. The average length of RACE text is more significant than 300, which is longer than other reading comprehension data sets (such as SQuAD) sequences.


Dialog act classification (DAC)

The dialogue act is a specific verbal component, which marks the dialogue according to the meaning category of the dialogue. DAC categorizes tags according to the meaning of the dialogue to help understand the speaker’s intentions.

Dialog state tracking challenge 4 (DSTC 4) [451] It belongs to the dialog act classification task and mainly focuses on dialog state tracking on human-human dialogs. It is divided into 89 training classes and contains 24,000 training texts and 6000 test texts.

ICSI meeting recorder dialog act (MRDA) [452] It includes about 75 h of speech from 75 naturally occurring meetings among 53 speakers. The number of categories is 5, and it contains 51,000 training texts, 11,000 test texts, and 11,000 validation texts.

Switchboard dialog act (SwDA) [493] The dataset extends the dialogue behavior label with rounds/discourses. The label summarizes the sentence structure, and relevant and pragmatic information of the relevant turn. The SwDA is split into 43 training classes and includes 1,003,000 training texts, 19,000 test texts, and 112,000 validation texts.


Text summarization

Text summarization is a summary of a given single or multiple documents. It is kept as concise as possible while ensuring that it reflects the critical content of the original document. It can be divided into extractive summarization and generative summarization. Extractive summarization is generated by extracting and splicing the critical sentences in documents. Generative summarization is generated by a model, which summarizes documents according to the required content expressed in documents.

NYT [455] The dataset comes from the corpus annotated by the New York Time. The named entities are annotated using the Stanford NER tool in conjunction with the Freebase knowledge base. It contains 9076 articles, with the remaining 100,834 divided into a training set (96,834 examples) and a validation set (4000 samples).

CNN/daily mail [457] It is used for the passage-based question-answering task, and it is popular in assessing ATS systems. The dataset consists of CNN/Daily Mail news stories paired with multi-sentence human-generated summaries. There are 287,226 training instances, 13,368 validation instances, and 11,490 testing instances in total.

Gigaword [464] This is a data set of English news chapters consisting of nearly 950 pieces. Headlines – stories from multiple sources, including the New York Times – include some articles with a one-sentence, short news feed.


Machine translation (MT)

It refers to the task of translation from one language to another with its semantic equivalence by a computer. There are three categories, rule-based machine translation, statistics-based machine translation, and neural network-based machine translation.

WMT14 [465] It is a grouping of datasets used in the Ninth Workshop on Statistical Machine Translation shared tasks, including a news translation task, a quality estimation task, a metrics task, and a medical text translation task.

WMT16 [466] This dataset is a grouping of datasets used in the First Conference on Machine Translation shared tasks. It has ten shared tasks, including a news translation task, an IT domain translation task, a biomedical translation task, an automatic post-editing task, a metrics task, a quality estimation task, a tuning task, a pronoun translation task, a bilingual document alignment task, and a multimodal translation task.

WMT17 [465] The dataset includes three MT tasks (news, biomedical, and multimodal), an automatic post-editing task, a quality estimation task, a task dedicated to the training of neural MT systems, a task on bandit learning for MT, an automatic post-editing task, and a metrics task.

WMT18 [468] It mainly features six shared tasks: a news translation task, a biomedical translation task, an automatic post-editing task, a metrics task, a quality estimation task, and a multimodal translation task. Participants must evaluate their approaches to the machine translation topic using the standard data sets created for the shared tasks.


Dialogue

As an essential way of man–machine interaction, the dialogue system offers a wide range of applications. The existing dialogue systems can be grouped into task-oriented dialogue systems and non-task-oriented dialogue systems from application scenarios. Among them, the non-task type of conversation system can also be called a chatbot.

DSTC2 [471] This is a multi-round dialogue data set of restaurant reservation fields, including 1612 training data, 506 verification data, and 1117 test data. It allows the user’s goals to change compared to DSTC1. DSTC2 is also richer in terms of the conversation state representation, including the slot value pairs of the user’s targets and the ways to find them.

MWOZ [471] It contains 8420/1000/1000 conversations for training, validation, and test sets, respectively. It contains 30 pairs in seven domains being a multi-domain fully-labeled corpus. Every sample includes a goal, multiple user and agent utterances, and annotations regarding slot values.

Out-of-scope (OOS) [471] The dataset includes 15,100 training, 3100 validation, and 5500 test sets, respectively. It contains 151 intent classes, containing 150 in-scope and one out-of-scope intent. The out-of-scope intent indicates that a user utterance failed to classify to given predefined objectives.

1.2 Downstream tasks and datasets on CV

Table 6 The statistics of the datasets used on downstream tasks

The datasets in CV mainly contain three types from the perspective of tasks: classification, detection, and segmentation. The popular datasets are concluded in Table 6, and some infrequently mentioned datasets in long tails are discussed in the text.


Classification

In this part, we first cover the popular large-scale datasets used frequently in both the pretext and downstream tasks. Then the domain datasets only used for the downstream tasks are unfolded.

MNIST [497] It’s a collection of handwritten digits that includes 60, 000 samples in training and 10, 000 in testing. The images are fixed-size with \(28\times 28\) pixels. The pixel values are from 0 to 255.0 in which pixel values smaller than 255.0 can be understood as background (white) and 255 means foreground (black). The labels are from 0 to 9 and only one of these digits exists in an image. Both traditional and deep learning methods are based on this most popular dataset despite advanced methods showing perfect results. Thus, Geoffrey Hinton has described it as “the drosophila of machine learning”.

Street view house numbers (SVHN) [498] In the domain of digit numbers, it collects real-world digit numbers from house numbers in Google Street View images. It includes 73, 257 digits for training, 26, 032 digits for testing, and 531, 131 additional. All of them are \(32\times 32\) color images with both class labels and character-level bounding boxes.

CIFAR [499] As more advanced methods show perfect results on the simple datasets, more sophisticated datasets such as CIFAR-10 and CIFAR-100 are conducted. These two datasets are closer to the real-world object. The CIFAR-10 contains 50,000 training images and 10,000 testing images, with 6000 images per class and \(32\times 32\) pixels in each RGB color image. The CIFAR-100 is similar to the CIFAR-10 but with more detailed label information. There are 100 classes containing 500 training images and 100 testing images in each class. In addition, these 100 “fine” classes are grouped equally into 20 “coarse” classes. Researchers can adapt it to suitable learning methods.

STL-10 [500] Inspired by the CIFAR-10 dataset, STL-10 is another \(96\times 96\) color image dataset containing similar 10 real-world classes. Each class has 500 training images and 800 testing images. The biggest difference is that STL-10 has 100, 000 unlabeled images for unsupervised learning. More construction information can be seen in [501].

Caltech-101 [502] It collects roughly \(300\times 200\) color images of objects belonging to 101 categories, with 40–800 images per category and 50 on average. The outlines of the objects in the pictures are annotated for the convenience of different learning methods.

ImageNet [312] This is one of the most popular and large-scale datasets on computer vision. It is built according to the hierarchical structure of WordNet [503]. The full ImageNet dataset contains 14, 197, 122 images and 21, 841 synsets indexed, attaching on average 1000 images to demonstrate each synset. The most frequently-used subset of ImageNet is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset from 2010 to 2017, containing tasks of classification, localization, and detection. The number of samples in training and testing datasets and the labels of images are determined by the specific task, more details are seen in [504].

HMDB51 [505, 506] In addition to the popular MNIST, there still exist many domain datasets used for the downstream tasks in the classification problem. HMDB51 is an action video database for a total of 7000 clips in 51 action classes. It contains five types of facial actions and body movements.

UCF101 [507] It is another action video data set designed for more realistic action recognition. It is an extension of the UCF50 [508] data set containing only 50 action categories with 101 action categories, collected from YouTube. What makes it a famous recognition dataset is the workshop in ICCV13 with UCF101 as its main competition benchmark.

Food-101 [509] This is a real-world food dataset of 101 food categories, with 750 and 250 images per class in training and testing dataset respectively.

Birdsnap [510] It is a fine-grained visual categorization of birds on a broad scale, with bounding boxes and the locations/annotations of 17 parts in the object. It contains 49,829 images of the 500 most common species in North America, with each species containing 69–100 images and most species having 100. In addition, some images are also labeled as male or female, immature or adult, and breeding or non-breeding plumage.

SUN397 To target the scene categorization, the extensive Scene UNderstanding (SUN) database [511, 512] fills the gap of the existing dataset with the limited scope of categories. This database has 899 categories and 130,519 images, and only images with more than \(200\times 200\) pixels were kept. SUN397 is a more well-sampled subset that maintains 397 categories with at least 100 images per category, in which other categories containing relatively few unique photographs are discarded.

Places205 Places205 [513] dataset is another large scale scene dataset consists of 2,448,873 images from 205 scene categories.

Cars [514] The dataset in the domain of cars contains 16,185 color images of 196 classes (at the level of Make, Model, Year) of cars. For convenience, this dataset is split into training and testing sets in roughly equal quantities.

Aircraft [515] It is another fine-grained visual classification designed for aircraft (also known as FGVC-Aircraft). A popular form of this dataset is the fine-grained recognition challenge 2013 (FGComp2013) [516] ran in parallel with the ILSVRC2013. There exist four-level hierarchies: Model, Variant, Family, Manufacturer, from finer to coarser to organize this database. The more detailed information is shown in [517].

Pets [518] It represents The Oxford-IIIT Pet Dataset that collects 37 pet categories with roughly 200 images per category. All images have an associated ground truth annotation of breed for classification, head ROI for detection, and pixel-level trimap for segmentation.

Flowers [519] Similarly, Flowers is another domain dataset in flowers also collected by Oxford; it contains Oxford-17 Flowers of 17 categories and Oxford-102 Flowers of 102 categories.

Describable textures dataset (DTD) [520] This is an evolving collection of textural images in the wild, which consists of 5640 images of 47 categories, with 120 images per category.

iNaturalist2018 [521] It is a large-scale species classification competition conducted on the FGVC5 workshop at CVPR2018. This dataset contains over 8000 species categories, with more than 450, 000 images in the training and validation dataset collected from iNaturalist [522].

JFT-300 M [523] JFT-300 M is an internal Google dataset introduced by Sun et al [523] and well-known from ViT Model [40]. It is labeled by algorithms that utilize human-computer communications and target classification tasks. This dataset finally contains 300 M images with over 1000 M labels, thus leading to the multiple labels attached to this large-scale dataset.


Detection

The detection is a popular task in the CV, and almost all the research is conducted on COCO and PASCAL VOC datasets.

COCO [339] This is a large-scale data set for object detection, segmentation, and caption; it contains 330, 000 RGB images, with more than 200, 000 labeled. There are 1.5 million object instances of 80 object categories involved. Thus, it is one of the most popular benchmark data set in detection and segmentation in parallel with the following PASCAL VOC.

PASCAL VOC [524] From 2005 through 2012, the dataset has run challenges assessing performance on object class recognition and has provided standardized image datasets for object class recognition. The main datasets used in self-supervised learning are VOC07, VOC11, and VOC12. Main competitions in VOC07 [525] contain classification and detection tasks; both of them consist of 20 objects and contain at least one object in each image. Thus, it is common to use VOC07 to serve as the downstream task for the detection.


Segmentation

The segmentation is a semantics-based pixel-level classification. These datasets are difficult to obtain and annotate, thus they are always used as a downstream task.

VOC11 [526] & VOC12 [527] Both VOC11 and VOC12 contains classification, detection, and segmentation tasks in the main competition, thus leading to the common use of downstream task for the segmentation.

ADE20K [528, 529] It collects 27, 574 images from both the SUN and Places205 databases, in which 25, 574 for training and 2000 for testing. All 707, 868 objects from 3688 categories existing in images are annotated. Especially, this dataset contains 193, 238 annotated object parts and parts of parts, and additional attributes, annotation time, depth ordering for the benefit of the research community.

NYU-Depth V2 [530] This is a dataset consisting of images and video sequences from 464 indoor scenes that are recorded by both the RGB and Depth cameras from 3 cities. It contains 1449 images with the ground truth of depth, and the original RGB values are also provided. In addition, there are 407, 024 new unlabeled frames and additional class labels for the objects in images.

Cityscapes [531, 532] It is a dataset of urban street scenes from 50 cities with the ground truth of semantic segmentation. The main instances are vehicles, people, and construction. The high-quality dense pixel annotations contain a volume of 5000 images. In addition to the fine annotations, coarser polygonal annotations are provided for a set of 20, 000 images. Moreover, the videos consist of not consistent images with high-quality annotations, and these annotated images with consistently changing views are provided for researchers.

LVIS [533] It is a dataset for large vocabulary instance segmentation. It features that (1) a category or word in one image is related to the only segmentation object; (2) more than 1200 categories are extracted from roughly 160,000 images; (3) long tails phenomenon exist in these categories; and (4) more than 2,000,000 high-quality instance segmentation masks.

Densely annotated video segmentation (DAVIS) [534] It is a video dataset designed for the in-depth analysis of the SOTA in video object segmentation, in which DAVIS 2017 [535] contains both semi-supervised (human-guided at the testing time) and unsupervised (human non-guided at test time) video sequences with multiple annotated instances.


Others

There are many datasets designed for special visual tasks such as inpainting. In addition, this part covers the data collection in the wild.

Paris StreetView [536] The dataset is designed for image inpainting task, which contains 14, 900 training images and 100 testing images collected from street views of Paris. This dataset is collected from Google Street View and mainly focuses on the buildings in the city.

Moving-MNIST [537] Based on MNIST, it is a video dataset designed for evaluating sequence prediction or reconstruction, which contains 10,000 sequences. Each video is long of 20 frames and consisted of two digits (possibly overlapped) moving inside a \(64\times 64\) patch. The first benchmark is reported on [538] by the method of LSTMs.

Yahoo Flickr creative commons 100 million (YFCC100M) [539, 540] The dataset is the largest public multimedia collection that is allowed to search by users for their own targets; this dataset can browse both images and videos. It is free and for researchers to explore and investigate subsets of the YFCC100M in real time. Subsets of the complete dataset can be retrieved by any keyword search and reviewed directly. In addition, the text information attached to any image or video is abundant, such as containing location information and user tags. Briefly, it is more a multimedia library than a domain dataset.

Data in the wild More generalized dataset concept in the self-supervised learning era is composed of multimedia websites, APP, or search engines such as Instagram, Flickr, Google Images, etc. I think pictures in the wild will play a major role in the future study of CV because of the quantity of data, the computation source, and the learning power of PFM.

1.3 F.3 Downstream tasks and datasets on graph

The purpose of the pretraining graph model is to improve the performance of downstream tasks. According to the different analysis objects of the downstream tasks, they can be divided into nodes, edges, and graphs. Meanwhile, the PFMs of GL have been widely used in a mass of fields. In this section, we combine the downstream tasks to conduct statistics on the pretraining datasets and the downstream task datasets.


Node-level tasks

Nodes are the most basic element of the graph, so lots of downstream tasks mainly focus on the analysis of nodes.

Node classification Node ClassiFication (NCF) is one of the most prevalent graph-based tasks, which has important analytical value in most of the different types of graph data. Different from the pseudo-labels assigned to nodes in the graph in self-supervised methods, the labels in NCF often come from external information such as manual annotation. Based on Definition 7 and 8, NCF can be divided into two types: transductive and inductive according to the visibility during training, verification, and testing. In addition, the result of NCF can be single-label or multi-label according to the mutual exclusion of labels. The statistical results of common NFC datasets are shown in Table 7.

Table 7 The statistics of the datasets for node-level tasks. Homogeneous:Hom, Heterogeneous:Het

Node clustering The goal of Node ClusterIng (NCI) is to divide a graph into different classes or clusters according to a certain standard so that the correlation of nodes in the same cluster is as large as possible, and the irrelevance of nodes that are not in the same cluster is also minimized. Although in the above-mentioned pretraining tasks, NCI is used as a pretext task has appeared, NCI can still test pretraining graph models based on other pretext tasks.

Top-K search The goal of task Top-K Search (TKS) is to search the K nodes with the highest predefined associations for a given node in the graph. Usually, TKS is used for search tasks such as recommendation and alignment. The detailed statistical results of the datasets are shown in Table 7.


Link-level tasks

The edge is also an important part of the graph structure, which associates independent nodes and is the key to distinguishing graph data from non-relational data. Especially in some specific fields (e.g., molecules, proteins), edges contain real information, so there are various tasks related to edges.

Link classification Similar to the NCF, the Link Classification (LC) also assigns one or more labels to a given edge. In fact, in LC, the nodes at both ends of the edge are still taken into consideration.

Link prediction Link Prediction (LP) is a common graph task (e.g., knowledge graph). The goal of LP is to predict edges that are removed or may exist in the graph. Similar to NCI, LP is also one of the pretext tasks in self-supervised learning, and its statistic results as shown in Table 8.

Table 8 The statistics of the datasets for LC. Homogeneous:Hom, Heterogeneous:Het

Top-K recommendation Top-K Recommendation (TKR) is exactly the same as the definition of TKS, the difference lies in the sorting goal.


Graph-level tasks

The graph-level task generally focuses on the distribution of nodes, edges, and attributes in a given graph, in order to infer the possible properties of the entire graph.

Graph classification Graph Classification (GC) is commonly used in social, molecular, and protein graph data, which aims to predict the property of the given community, chemical compound, and protein. The statistic results as shown in Table 9.

Table 9 The statistics of the datasets for GC. Homogeneous:Hom, Heterogeneous:Het

Data source

The PFMs of GL have been widely used in a mass of fields. We will descript the details of the pretraining datasets and the downstream task datasets.

Citation and co-author network A citation is a basic local representation, whose structure reflects the citation relationships of papers in a research direction or field. Specifically, a citation network is a kind of relational data composed of research papers as nodes and citation relations as edges. Among them, the citation network used in the GL model usually comes from local samples of common citation databases, e.g., Cora, Citeseer, and PubMed, and serves as downstream tasks. Similarly, the co-author network is a dataset of scientific collaboration that corresponds to a researcher’s ego network, in which the researcher and their collaborators are nodes and an edge indicates collaboration between two researchers. According to different requirements of downstream tasks, such co-author networks can be used for various tasks, e.g., node classification and graph classification.

Molecular and protein network A molecular network usually refers to a compound composed of atoms and atomic bonds, and predicting the properties of the compound is usually regarded as a graph classification task. For example, MUTAG is a collection of nitroaromatic compounds whose goal is to predict their mutagenicity to Salmonella typhimurium. PTC uses a graph to show the structure of multiple compounds and aims to predict the carcinogenicity of different compounds in rats. The protein network is a collection of proteins classified as either enzymes or non-enzymes. The amino acids are represented by nodes, and two nodes are connected by an edge if they are less than 6 Angstroms apart.

Social and movie network The social network is the social-relational data in the real network environment, which usually represents the relationship between users or posts. For instance, Reddit is a graph dataset comprised of Reddit posts made in September 2014. BlogCatalog is a graph dataset that represents a network of social relationships between bloggers who are listed on the BlogCatalog website. The movie network is usually composed of actors and their co-occurrence participation in the movie. For example, IMDB-B is a movie collaboration data set that contains a large number of self-networks of actors who play movie roles in IMDB. Nodes in each graph represent actors/actresses, and if they appear in the same film, an edge connects them. These graphs are based on action and romance genres. The difference between IMDB-M and IMDB-B is that a node in the graph represents one or more actors.

Others Some of the rarer graph data are used to test the universality of the PFM, such as word networks (Wikipedia), book networks (Book-crossing), and airline networks (US-Airport). In addition, there are also some special graph structures adapted to specific models, such as spatiotemporal graphs (METR-LA).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, C., Li, Q., Li, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02443-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02443-6

Keywords