A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT

Ce Zhou¹^na1,
Qian Li²^na1,
Chen Li²^na1,
Jun Yu³^na1,
Yixin Liu³^na1,
Guangjing Wang¹,
Kai Zhang³,
Cheng Ji²,
Qiben Yan¹,
Lifang He³,
Hao Peng²,
Jianxin Li²,
Jia Wu⁴,
Ziwei Liu⁵,
Pengtao Xie⁶,
Caiming Xiong⁷,
Jian Pei⁸,
Philip S. Yu⁹ &
…
Lichao Sun³

5077 Accesses
Explore all metrics

Abstract

Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks across different data modalities. A PFM (e.g., BERT, ChatGPT, GPT-4) is trained on large-scale data, providing a solid parameter initialization for a wide range of downstream applications. In contrast to earlier methods that use convolution and recurrent modules for feature extraction, BERT learns bidirectional encoder representations from Transformers, trained on large datasets as contextual language models. Similarly, the Generative Pretrained Transformer (GPT) method employs Transformers as feature extractors and is trained on large datasets using an autoregressive paradigm. Recently, ChatGPT has demonstrated significant success in large language models, utilizing autoregressive language models with zero-shot or few-shot prompting. The remarkable success of PFMs has driven significant breakthroughs in AI, leading to numerous studies proposing various methods, datasets, and evaluation metrics, which increases the demand for an updated survey. This study provides a comprehensive review of recent research advancements, challenges, and opportunities for PFMs in text, image, graph, and other data modalities. It covers the basic components and existing pretraining methods used in natural language processing, computer vision, and graph learning, while also exploring advanced PFMs for different data modalities and unified PFMs that address data quality and quantity. Additionally, the review discusses key aspects such as model efficiency, security, and privacy, and provides insights into future research directions and challenges in PFMs. Overall, this survey aims to shed light on the research of the PFMs on scalability, security, logical reasoning ability, cross-domain learning ability, and user-friendly interactive ability for artificial general intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 3

Fig. 4

Fig. 5

Fig. 10

Transformers in the Service of Description Logic-Based Contexts

Enhancing Unsupervised Pretraining with External Knowledge for Natural Language Inference

Large language models (LLMs): survey, technical frameworks, and future challenges

Article Open access 18 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Notes

References

Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E et al (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37:51–89
Article Google Scholar
Forsyth D, Ponce J (2011) Computer vision: a modern approach. University of Illinois at Urbana-Champaign, USA
Google Scholar
Bondy JA, Murty USR et al (1976) Graph theory with applications. Macmillan, London
Book Google Scholar
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897
Li J, Tang T, Zhao WX, Wen J-R (2021) Pretrained language models for text generation: a survey
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al (2020) A survey on visual transformer. arXiv
Sanchez S, Romero H, Morales A (2020) A review: comparison of performance metrics of pretrained models for object detection using the tensorflow framework. In: IOP conference series: materials science and engineering
Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J (2019) Pre-training graph neural networks. arXiv
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. In: Proceedings of the IEEE
Bengio Y, Ducharme R, Vincent P, Janvin C (2000) A neural probabilistic language model. Adv Neural Inf Proc Syst
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations, ICLR 2013. Workshop track proceedings, Scottsdale, Arizona, USA, May 2–4, 2013
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT
Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. In: NeurIPS
Chen M, Tworek J, Jun H, Yuan Q, Pinto HPdO, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G et al (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
Neelakantan A, Xu T, Puri R, Radford A, Han JM, Tworek J, Yuan Q, Tezak N, Kim JW, Hallacy C et al (2022) Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005
Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. Adv Neural Inf Process Syst
Stiennon N, Ouyang L, Wu J, Ziegler D, Lowe R, Voss C, Radford A, Amodei D, Christiano PF (2020) Learning to summarize with human feedback. Adv Neural Inf Process Syst 33:3008–3021
Google Scholar
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A., et al. (2022) Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. arXiv
Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 3045–3059
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 255–269
Zhang Z, Zhang A, Li M, Smola A (2023) Automatic chain of thought prompting in large language models. In: International conference on learning representations
Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi EH, Le QV, Zhou D et al (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv Neu Inform Process Syst 35:24824–24837
OpenAI (2023) GPT-4 technical report
Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H (2022) Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052
Lu J, Clark C, Zellers R, Mottaghi R, Kembhavi A (2022) Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916
Singh A, Hu R, Goswami V, Couairon G, Galuba W, Rohrbach M, Kiela D (2022) Flava: a foundational language and vision alignment model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15638–15650
Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed OK, Singhal S, Som S et al (2022) Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442
Clark K, Luong M, Le QV, Manning CD (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR
Wallace E, Rodriguez P, Feng S, Yamada I, Boyd-Graber J (2019) Trick me if you can: human-in-the-loop generation of adversarial examples for question answering. Trans Assoc Comput Linguist 7:387–401
Nie Y, Williams A, Dinan E, Bansal M, Weston J, Kiela D (2020) Adversarial NLI: a new benchmark for natural language understanding. In: ACL
Niven T, Kao H (2019) Probing neural network comprehension of natural language arguments. In: ACL
Wang G, Ivanov N, Chen B, Wang Q, Yan Q (2023) Graph learning for interactive threat detection in heterogeneous smart home rule data. In: 2023 ACM SIGMOD international conference on management of data. ACM
Gordon MA, Duh K, Andrews N (2020) Compressing BERT: studying the effects of weight pruning on transfer learning. In: RepL4NLP@ACL . https://doi.org/10.18653/v1/2020.repl4nlp-1.18
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2020) ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR
Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, Qiu J, Zhang L, Han W, Huang M et al (2021) Pre-trained models: past, present and future. AI Open 2:225–250
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv
Guo M-H, Xu T-X, Liu J-J, Liu Z-N, Jiang P-T, Mu T-J, Zhang S-H, Martin RR, Cheng M-M, Hu S-M (2022) Attention mechanisms in computer vision: a survey. Comput Visual Media 8(3):331–368
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Yun S, Jeong M, Kim R, Kang J, Kim HJ (2019) Graph transformer networks. Adv Neural Inf Process Syst 32
Dehghani M, Djolonga J, Mustafa B, Padlewski P, Heek J, Gilmer J, Steiner AP, Caron M, Geirhos R, Alabdulmohsin I et al (2023) Scaling vision transformers to 22 billion parameters. In: International conference on machine learning. PMLR, pp 7480–7512
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S et al (2022) Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.703
Clark K, Luong M-T, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv
Donahue J, Krähenbühl P, Darrell T (2016) Adversarial feature learning. arXiv
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: ECCV
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog
Caruana R (1997) Multitask learning. Mach Learn 28:41–75
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv
Song K, Tan X, Qin T, Lu J, Liu T (2020) Mpnet: masked and permuted pre-training for language understanding. In: NeurIPS
Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol (TIST) 13(2):1–41
Google Scholar
Song K, Tan X, Qin T, Lu J, Liu T-Y (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv
Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H-W (2019) Unified language model pre-training for natural language understanding and generation. arXiv
Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019) Ernie: enhanced representation through knowledge integration. arXiv
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) Ernie 2.0: a continual pre-training framework for language understanding. In: AAAI
Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-training with whole word masking for Chinese BERT. T-ASL. https://doi.org/10.1109/TASLP.2021.3124365
Article Google Scholar
Diao S, Bai J, Song Y, Zhang T, Wang Y (2020) ZEN: pre-training chinese text encoder enhanced by n-gram representations. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.425
Tsai H, Riesa J, Johnson M, Arivazhagan N, Li X, Archer A (2019) Small and practical bert models for sequence labeling. arXiv
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng H-T, Jin A, Bos T, Baker L, Du Y et al (2022) Lamda: language models for dialog applications. arXiv preprint arXiv:2201.08239
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. arXiv
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP
Dai AM, Le QV (2015) Semi-supervised sequence learning. arXiv
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. TACL
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: Contextualized word vectors. arXiv
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv
Kong L, d’Autume CdM, Ling W, Yu L, Dai Z, Yogatama D (2019) A mutual information maximization perspective of language representation learning. arXiv
Wang W, Bi B, Yan M, Wu C, Bao Z, Xia J, Peng L, Si L (2019) Structbert: incorporating language structures into pre-training for deep language understanding. arXiv
Xiong W, Du J, Wang WY, Stoyanov V (2019) Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. arXiv
Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, Smith NA (2019) Knowledge enhanced contextual word representations. arXiv
Huang H, Liang Y, Duan N, Gong M, Shou L, Jiang D, Zhou M (2019) Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. arXiv
Eisenschlos JM, Ruder S, Czapla P, Kardas M, Gugger S, Howard J (2019) Multifit: efficient multi-lingual language model fine-tuning. arXiv
Beltagy I, Lo K, Cohan A (2019) Scibert: a pretrained language model for scientific text. arXiv
Sun S, Cheng Y, Gan Z, Liu J (2019) Patient knowledge distillation for bert model compression. arXiv
Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv
Zafrir O, Boudoukh G, Izsak P, Wasserblat M (2019) Q8bert: quantized 8bit bert. In: 2019 Fifth workshop on energy efficient machine learning and cognitive computing-NeurIPS edition (EMC2-NIPS). IEEE, pp 36–39
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv
Liu W, Zhou P, Zhao Z, Wang Z, Deng H, Ju Q (2020) Fastbert: a self-distilling bert with adaptive inference time. arXiv
Martin L, Müller B, Suárez PJO, Dupont Y, Romary L, Clergerie É, Seddah D, Sagot B (2020) Camembert: a tasty french language model. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.645
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.747
Kitaev N, Kaiser L, Levskaya A (2020) Reformer: the efficient transformer. In: ICLR
Shen S, Dong Z, Ye J, Ma L, Yao Z, Gholami A, Mahoney MW, Keutzer K (2020) Q-bert: Hessian based ultra low precision quantization of bert. In: AAAI
Chi Z, Dong L, Wei F, Wang W, Mao X-L, Huang H (2020) Cross-lingual natural language generation via pre-training. In: AAAI
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020) K-bert: enabling language representation with knowledge graph. In: AAAI
Jiang Z, Yu W, Zhou D, Chen Y, Feng J, Yan S (2020) Convbert: improving BERT with span-based dynamic convolution. In: NeurIPS
Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M (2020) Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NeurIPS
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742
Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, Zhang Z (2020) Colake: contextualized language and knowledge embedding. In: COLING. https://doi.org/10.18653/v1/2020.coling-main.327
Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D (2020) Flaubert: unsupervised language model pre-training for French. In: LREC
Shen T, Mao Y, He P, Long G, Trischler A, Chen W (2020) Exploiting structured knowledge in text via graph-guided representation learning. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.722
Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2020) Tinybert: distilling BERT for natural language understanding. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.372
Delobelle P, Winters T, Berendt B (2020) Robbert: a dutch roberta-based language model. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.292
He B, Zhou D, Xiao J, Jiang X, Liu Q, Yuan NJ, Xu T (2020) Integrating graph contextualized knowledge into pre-trained language models. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.207
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67
Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: EACL
Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, Tang J (2021) KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 9:176–194
Gao T, Yao X, Chen D (2021) Simcse: simple contrastive learning of sentence embeddings. CoRR arXiv:2104.08821
Du N, Huang Y, Dai AM, Tong S, Lepikhin D, Xu Y, Krikun M, Zhou Y, Yu AW, Firat O et al (2022) Glam: efficient scaling of language models with mixture-of-experts. In: International conference on machine learning. PMLR, pp 5547–5569
Chi Z, Huang S, Dong L, Ma S, Singhal S, Bajaj P, Song X, Wei F (2021) Xlm-e: cross-lingual language model pre-training via electra. arXiv preprint arXiv:2106.16138
Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, Chaffin A, Stiegler A, Scao TL, Raja A et al (2021) Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207
Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, Aslanides J, Henderson S, Ring R, Young S et al (2021) Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446
Smith S, Patwary M, Norick B, LeGresley P, Rajbhandari S, Casper J, Liu Z, Prabhumoye S, Zerveas G, Korthikanti V et al (2022) Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990
Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, Casas DL, Hendricks LA, Welbl J, Clark A et al (2022) Training compute-optimal large language models. arXiv preprint arXiv:2203.15556
Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV et al (2022) Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV (2022) Finetuned language models are zero-shot learners. In: International conference on learning representations
Honovich O, Scialom T, Levy O, Schick T (2022) Unnatural instructions: tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689
Wang Y, Mishra S, Alipoormolabashi P, Kordi Y, Mirzaei A, Naik A, Ashok A, Dhanasekaran AS, Arunkumar A, Stap D et al (2022) Super-natural instructions: generalization via declarative instructions on 1600+ nlp tasks. In: Proceedings of the 2022 conference on empirical methods in natural language processing, pp 5085–5109
Mishra S, Khashabi D, Baral C, Hajishirz, H (2022) Cross-task generalization via natural language crowdsourcing instructions. In: Proceedings of the 60th Annual meeting of the association for computational linguistics (volume 1: long papers), pp 3470–3487
Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H (2022) Self-instruct: aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560
Weidinger L, Mellor J, Rauh M, Griffin C, Uesato J, Huang P-S, Cheng M, Glaese M, Balle B, Kasirzadeh A et al (2021) Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359
Kiegeland S, Kreutzer J (2021) Revisiting the weaknesses of reinforcement learning for neural machine translation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1673–1681
Jaques N, Shen JH, Ghandeharioun A, Ferguson C, Lapedriza A, Jones N, Gu S, Picard R (2020) Human-centric dialog training via offline reinforcement learning. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 3985–4003
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Pang RY, He H (2021) Text generation by learning from demonstrations. In: Proceedings of the international conference on learning representations
Hausknecht M, Ammanabrolu P, Côté M-A, Yuan X (2020) Interactive fiction games: a colossal adventure. Proc AAAI Conf Artif Intell 34:7903–7910
Google Scholar
Snell C, Kostrikov I, Su Y, Yang M, Levine S (2022) Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871
Lu X, Welleck S, Jiang L, Hessel J, Qin L, West P, Ammanabrolu P, Choi Y (2022) Quark: controllable text generation with reinforced unlearning. arXiv preprint arXiv:2205.13636
Uc-Cetina V, Navarro-Guerrero N, Martin-Gonzalez A, Weber C, Wermter S (2022) Survey on reinforcement learning for language processing. Artif Intell Rev 56:1543–1575
Article Google Scholar
Ramamurthy R, Ammanabrolu P, Brantley K, Hessel J, Sifa R, Bauckhage C, Hajishirzi H, Choi Y (2022) Is reinforcement learning (not) for natural language processing?: benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241
Wu J, Ouyang L, Ziegler DM, Stiennon N, Lowe R, Leike J, Christiano P (2021) Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862
Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, Hesse C, Jain S, Kosaraju V, Saunders W et al (2021) Webgpt: browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V, Ewalds T, Rauh M, Weidinger L, Chadwick M, Thacker P et al (2022) Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375
Bai Y, Kadavath S, Kundu S, Askell A, Kernion J, Jones A, Chen A, Goldie A, Mirhoseini A, McKinnon C et al (2022) Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073
Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S et al (2022) Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416
Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y (2022) Large language models are zero-shot reasoners. Adv Neu Inf Process Syst 35:22199–22213
Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T (2014) Discriminative unsupervised feature learning with convolutional neural networks. Adv Neu Inf Process Syst
Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller M, Brox T (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI. TPAMI-2015-05-0348.R1
Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: ICCV
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by inpainting. In: CVPR
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: ECCV
Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV
Kim D, Cho D, Yoo D, Kweon IS (2018) Learning image representations by completing damaged jigsaw puzzles. In: WACV
Noroozi M, Pirsiavash H, Favaro P (2017) Representation learning by learning to count. In: ICCV
Bojanowski P, Joulin A (2017) Unsupervised learning by predicting noise. In: ICML
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv
Henaff O (2020) Data-efficient image recognition with contrastive predictive coding. In: ICML
Donahue J, Simonyan K (2019) Large scale adversarial representation learning. In: NeurIPS, pp 10541–10551
Dumoulin V, Belghazi I, Poole B, Mastropietro O, Lamb A, Arjovsky M, Courville A (2016) Adversarially learned inference. arXiv
Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I (2020) Generative pretraining from pixels. In: ICML
Bao H, Dong L, Piao S, Wei F (2021) Beit: Bert pre-training of image transformers. In: International conference on learning representations
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: International conference on machine learning. PMLR, pp 8821–8831
He K, Chen X, Xi, S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16000–16009
Xie Z, Zhang Z, Cao Y, Lin Y, Bao J, Yao Z, Dai Q, Hu H (2022) Simmim: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9653–9663
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. arXiv preprint arXiv:2304.02643
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: CVPR
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Li X, Wang W, Yang L, Yang J (2022) Uniform masking: enabling mae pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063
Chen J, Hu M, Li B, Elhoseiny M (2022) Efficient self-supervised vision pretraining with local masked reconstruction. arXiv preprint arXiv:2206.00790
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv
Zhuang C, Zhai AL, Yamins D (2019) Local aggregation for unsupervised learning of visual embeddings. In: ICCV
Misra I, Maaten L (2020) Self-supervised learning of pretext-invariant representations. In: CVPR
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv
Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG et al (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: ICML
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv
Goyal P, Caron M, Lefaudeux B, Xu M, Wang P, Pai V, Singh M, Liptchinsky V, Misra I, Joulin A et al (2021) Self-supervised pretraining of visual features in the wild. arXiv
Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P (2020) Designing network design spaces. In: CVPR
Chen X, He K (2021) Exploring simple siamese representation learning. In: CVPR
Li J, Zhou P, Xiong C, Hoi SCH (2021) Prototypical contrastive learning of unsupervised representations. In: ICLR. OpenReview.net
Zhang L, Qi G-J, Wang L, Luo J (2019) Aet vs. aed: unsupervised representation learning by auto-encoding transformations rather than data. In: CVPR
Bachman P, Hjelm RD, Buchwalter W (2019) Learning representations by maximizing mutual information across views. arXiv
Yan X, Misra I, Gupta A, Ghadiyaram D, Mahajan D (2020) Clusterfit: improving generalization of visual representations. In: CVPR
Asano YM, Rupprecht C, Vedaldi A (2019) Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G (2020) Big self-supervised models are strong semi-supervised learners. arXiv
Tian Y, Krishnan D, Isola P (2019) Contrastive multiview coding. arXiv
Cubuk ED, Zoph B, Shlens J, Le QV (2019) Randaugment: practical data augmentation with no separate search. arXiv
Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P (2020) What makes for good views for contrastive learning. arXiv
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. arXiv
Mitrovic J, McWilliams B, Walker JC, Buesing LH, Blundell C (2021) Representation learning via invariant causal mechanisms. In: ICLR
Tian Y, Chen X, Ganguli S (2021) Understanding self-supervised learning dynamics without contrastive pairs. In: ICML. Proceedings of machine learning research
Xie Z, Lin Y, Yao Z, Zhang Z, Dai Q, Cao Y, Hu H (2021) Self-supervised learning with swin transformers. arXiv
Li Z, Chen Z, Yang F, Li W, Zhu Y, Zhao C, Deng R, Wu L, Zhao R, Tang M et al (2021) Mst: masked self-supervised transformer for visual representation. Adv Neural Inf Process Syst 34
Bao H, Dong L, Piao S, Wei F (2022) BEit: BERT pre-training of image transformers. In: International conference on learning representations. https://openreview.net/forum?id=p-BhZSz59o4
Chen X, Ding M, Wang X, Xin Y, Mo S, Wang Y, Han S, Luo P, Zeng G, Wang J (2022) Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026
Dong X, Bao J, Zhang T, Chen D, Zhang W, Yuan L, Chen D, Wen F, Yu N (2021) Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710
You Y, Chen T, Wang Z, Shen Y (2020) When does self-supervision help graph convolutional networks? In: ICML. Proceedings of machine learning research, pp 10871–10880
Jin W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020) Self-supervised learning on graphs: deep insights and new direction. CoRR arXiv:2006.10141
Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande VS, Leskovec J (2020) Strategies for pre-training graph neural networks. In: ICLR
Perozzi B, Al-Rfou R, Skien S (2014) Deepwalk: online learning of social representations. In: ACM SIGKDD
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: ACM SIGKDD
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) LINE: large-scale information network embedding. In: WWW
Kipf TN, Welling M (2016) Variational graph auto-encoders. CoRR
Qiu J, Chen Q, Dong Y, Zhang J, Yang H, Ding M, Wang K, Tang J (2020) GCC: graph contrastive coding for graph neural network pre-training. In: KDD
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2021) Graph contrastive learning with adaptive augmentation. In: WWW
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. In: NeurIPS
Mavromatis C, Karypis G (2021) Graph infoclust: maximizing coarse-grain mutual information in graphs. In: PAKDD
Sun Q, Li J, Peng H, Wu J, Ning Y, Yu PS, He L (2021) SUGAR: subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In: WWW
Velickovic P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2019) Deep graph infomax. In: ICLR
Hassani K, Ahmadi AHK (2020) Contrastive multi-view representation learning on graphs. In: ICML. Proceedings of machine learning research, pp 4116–4126
Jiao Y, Xiong Y, Zhang J, Zhang Y, Zhang T, Zhu Y (2020) Sub-graph contrast for scalable self-supervised graph representation learning. In: ICDM, pp 222–231
Sun K, Lin Z, Zhu Z (2020) Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes. In: AAAI, pp 5892–5899
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J (2020) Self-supervised graph transformer on large-scale molecular data. In: NeurIPS
Tan Q, Liu N, Huang X, Chen R, Choi S-H, Hu X (2022) Mgae: Masked autoencoders for self-supervised learning on graphs. arXiv preprint arXiv:2201.02534
Hou Z, Liu X, Dong Y, Wang C, Tang J et al (2022) Graphmae: Self-supervised masked graph autoencoders. arXiv preprint arXiv:2205.10803
Li J, Wu R, Sun W, Chen L, Tian S, Zhu L, Meng C, Zheng Z, Wang W (2022) Maskgae: masked graph modeling meets graph autoencoders. arXiv preprint arXiv:2205.10053
Tian Y, Dong K, Zhang C, Zhang C, Chawla NV (2022) Heterogeneous graph masked autoencoders. arXiv preprint arXiv:2208.09957
Wan S, Pan S, Yang J, Gong C (2021) Contrastive and generative graph convolutional networks for graph-based semi-supervised learning. In: AAAI
Zhang J, Zhang H, Xia C, Sun L (2020) Graph-bert: only attention is needed for learning graph representations. arXiv arXiv:2001.05140
Peng Z, Huang W, Luo M, Zheng Q, Rong Y, Xu T, Huang J (2020) Graph representation learning via graphical mutual information maximization. In: WWW
Hu Z, Dong Y, Wang K, Chang K, Sun Y (2020) GPT-GNN: generative pre-training of graph neural networks. In: KDD
Wang G, Guo H, Li A, Liu X, Yan Q (2023) Federated iot interaction vulnerability analysis. In: 2023 IEEE 39th international conference on data engineering (ICDE). IEEE
Hamilton WL, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: NIPS
Hwang D, Park J, Kwon S, Kim K, Ha J, Kim HJ (2020) Self-supervised auxiliary learning with meta-paths for heterogeneous graphs. In: NeurIPS
Sun F, Hoffmann J, Verma V, Tang J (2020) Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: ICLR
Park C, Kim D, Han J, Yu H (2020) Unsupervised attributed multiplex network embedding. In: AAAI, pp 5371–5378
You Y, Chen T, Shen Y, Wang Z (2021) Graph contrastive learning automated. CoRR arXiv:2106.07594
Zeng J, Xie P (2021) Contrastive self-supervised learning for graph classification. In: AAAI, pp 10824–10832
Xu M, Wang H, Ni B, Guo H, Tang J (2021) Self-supervised graph-level representation learning with local and global structure. CoRR arXiv:2106.04113
Wang P, Agarwal K, Ham C, Choudhury S, Reddy CK (2021) Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks. In: WWW
Cao J, Lin X, Guo S, Liu L, Liu T, Wang B (2021) Bipartite graph embedding via mutual information maximization. In: WSDM
Wang X, Liu N, Han H, Shi C (2021) Self-supervised heterogeneous graph neural network with co-contrastive learning. KDD arXiv:2105.09111
Kim D, Oh A (2021) How to find your friendly neighborhood: graph attention design with self-supervision. In: ICLR. https://openreview.net/forum?id=Wi5KUNlqWty
Sun M, Xing J, Wang H, Chen B, Zhou J (2021) Mocl: contrastive learning on molecular graphs with multi-level domain knowledge. CoRR arXiv:2106.04509
Schneider S, Baevski A, Collobert R, Auli M (2019) wav2vec: unsupervised pre-training for speech recognition. In: INTERSPEECH
Baevski A, Schneider S, Auli M (2020) vq-wav2vec: self-supervised learning of discrete speech representations. In: ICLR
Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS
Chung Y, Glass JR (2020) Generative pre-training for speech with autoregressive predictive coding. In: ICASSP
Song X, Wang G, Huang Y, Wu Z, Su D, Meng H (2020) Speech-xlnet: unsupervised acoustic model pretraining for self-attention networks. In: INTERSPEECH
Chung Y, Wang Y, Hsu W, Zhang Y, Skerry-Ryan RJ (2019) Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: ICASSP
Denisov P, Vu NT (2020) Pretrained semantic speech embeddings for end-to-end spoken language understanding via cross-modal teacher-student learning. In: Interspeech
Chung Y-A, Zhu C, Zeng M (2021) SPLAT: speech-language joint pre-training for spoken language understanding. In: ACL
Zeng M, Tan X, Wang R, Ju Z, Qin T, Liu T-Y (2021) Musicbert: symbolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630
Huang Y-S, Yang Y-H (2020) Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM international conference on multimedia, pp 1180–1188
Verma P, Berger J (2021) Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335
Fernando B, Bilen H, Gavves E, Gould S (2017) Self-supervised video representation learning with odd-one-out networks. In: CVPR
Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: ECCV
Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: AAAI
Tao L, Wang X, Yamasaki T (2020) Self-supervised video representation learning using inter-intra contrastive framework. In: ACM multimedia
Lorre G, Rabarisoa J, Orcesi A, Ainouz S, Canu S (2020) Temporal contrastive pretraining for video action recognition. In: WACV
Yao T, Zhang Y, Qiu Z, Pan Y, Mei T (2020) Seco: Exploring sequence supervision for unsupervised representation learning. arXiv
Li LH, Yatskar M, Yin D, Hsieh C, Chang K (2019) Visualbert: a simple and performant baseline for vision and language. CoRR arXiv:1908.03557
Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2020) VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS
Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, McGrew B, Sutskever I, Chen M (2021) Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
Sayed N, Brattoli B, Ommer B (2018) Cross and learn: cross-modal self-supervision. In: GCPR
Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: CVPR
Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: ECCV
Zlotchevski A, Drain D, Svyatkovskiy A, Clement CB, Sundaresan N, Tufano M (2022) Exploring and evaluating personalized models for code generation. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pp 1500–1508
Thakur, S., Ahmad B, Fan Z, Pearce H, Tan B, Karri R, Dolan-Gavitt B, Garg S (2022) Benchmarking large language models for automated verilog rtl code generation. arXiv preprint arXiv:2212.11140
Nijkamp E, Pang B, Hayashi H, Tu L, Wang H, Zhou Y, Savarese S, Xiong C (2022) Codegen: an open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
Poesia G, Polozov O, Le V, Tiwari A, Soares G, Meek C, Gulwani S (2022) Synchromesh: reliable code generation from pre-trained language models. arXiv preprint arXiv:2201.11227
Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: universal image-text representation learning. In: European conference on computer vision. Springer, pp 104–120
Zhu X, Zhu J, Li H, Wu X, Li H, Wang X, Dai J (2022) Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16804–16815
Reed S, Zolna K, Parisotto E, Colmenarejo SG, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg JT et al (2022) A generalist agent. arXiv preprint arXiv:2205.06175
Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, Wu H, Wang H (2020) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409
Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814
Achiam J, Sastry S (2017) Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732
Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: International conference on machine learning. PMLR, pp 2778–2787
Tang H, Houthooft R, Foote D, Stooke A, Xi Chen O, Duan Y, Schulman J, DeTurck F, Abbeel P (2017) # Exploration: a study of count-based exploration for deep reinforcement learning. Adv Neural Inf Process Syst 30
Dey P, Medya S (2019) Manipulating node similarity measures in network. arXiv
Han B, Zheng C, Chan H, Paster K, Zhang M, Ba J (2021) Learning domain invariant representations in goal-conditioned block mdps. Adv Neural Inf Process Syst 34:764–776
Google Scholar
Ding Y, Florensa C, Abbeel P, Phielipp M (2019) Goal-conditioned imitation learning. Adv Neural Inf Process Syst 32
Shah R, Kumar V (2021) Rrl: Resnet as representation for reinforcement learning. arXiv preprint arXiv:2107.03380
Xiao T, Radosavovic I, Darrell T, Malik J (2022) Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173
Schwarzer M, Rajkumar N, Noukhovitch M, Anand A, Charlin L, Hjelm RD, Bachman P, Courville AC (2021) Pretraining representations for data-efficient reinforcement learning. Adv Neural Inf Process Syst 34:12686–12699
Google Scholar
Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929
Ha D, Schmidhuber J (2018) World Models https://doi.org/10.5281/zenodo.1207631arXiv:1803.10122 [cs, stat]
Jaderberg M, Mnih V, Czarnecki WM, Schaul T, Leibo JZ, Silver D, Kavukcuoglu K (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv. https://doi.org/10.48550/arXiv.1611.05397
Higgins I, Pal A, Rusu AA, Matthey L, Burgess CP, Pritzel A, Botvinick M, Blundell C, Lerchner A (2018) DARLA: improving zero-shot transfer in reinforcement learning. arXiv. https://doi.org/10.48550/arXiv.1707.08475
Finn C, Yu T, Fu J, Abbeel P, Levine S (2016) Generalizing skills with semi-supervised reinforcement learning. arXiv preprint arXiv:1612.00429
Shah R, Kumar V (2021) RRL: Resnet as representation for reinforcement learning. arXiv
Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2021) Data-efficient reinforcement learning with self-predictive representations. arXiv. https://doi.org/10.48550/arXiv.2007.05929
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603
Hafner D, Lillicrap T, Norouzi M, Ba J (2020) Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193
Deng F, Jang I, Ah, S (2022) Dreamerpro: reconstruction-free model-based reinforcement learning with prototypical representations. In: International conference on machine learning. PMLR, pp 4956–4975
Wu P, Escontrela A, Hafner D, Goldberg K, Abbeel P (2022) Daydreamer: world models for physical robot learning. arXiv preprint arXiv:2206.14176
Laskin M, Srinivas A, Abbeel P (2020) Curl: contrastive unsupervised representations for reinforcement learning. In: International conference on machine learning. PMLR, pp 5639–5650
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Adv Neural Inf Process Syst 33:19884–19895
Google Scholar
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649
Yarats D, Fergus R, Lazaric A, Pinto L (2021) Mastering visual continuous control: improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645
Nair S, Rajeswaran A, Kumar V, Finn C, Gupta A (2022) R3m: a universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601
Parisi S, Rajeswaran A, Purushwalkam S, Gupta A (2022) The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580
Zhou C, Yan Q, Shi Y, Sun L (2022) $\{$DoubleStar$\}$:$\{$Long-Range$\}$ attack towards depth estimation based obstacle avoidance in autonomous systems. In: 31st USENIX security symposium (USENIX Security 22), pp 1885–1902
Zhou C, Yan Q, Kent D, Wang G, Zhang Z, Radha H (2024) Optical lens attack on deep learning based monocular depth estimation. arXiv preprint arXiv:2409.17376
Zhou C, Yan Q, Liu S (2024) Transient adversarial 3d projection attacks on object detection in autonomous driving. arXiv preprint arXiv:2409.17403
Jin D, Jin Z, Zhou JT, Szolovits P (2020) Is bert really robust? A strong baseline for natural language attack on text classification and entailment. In: AAAI
Zang Y, Qi F, Yang C, Liu Z, Zhang M, Liu Q, Sun M (2020) Word-level textual adversarial attacking as combinatorial optimization. In: ACL
Wallace E, Feng S, Kandpal N, Gardner M, Singh S (2019) Universal adversarial triggers for attacking and analyzing NLP. In: EMNLP-IJCNLP
Kurita K, Michel P, Neubig G (2020) Weight poisoning attacks on pretrained models. In: ACL
Schuster R, Schuster T, Meri Y, Shmatikov V (2020) Humpty dumpty: controlling word meanings via corpus poisoning. In: IEEE symposium on security and privacy
Bao R, Wang J, Zhao H (2021) Defending pre-trained language models from adversarial word substitution without performance sacrifice. In: ACL/IJCNLP
Zhang Z, Li Y, Wang J, Liu B, Li D, Guo Y, Chen X, Liu Y (2022) Remos: reducing defect inheritance in transfer learning via relevant model slicing. In: Proceedings of the 44th international conference on software engineering, pp 1856–1868
Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown TB, Song D, Erlingsson U et al (2021) Extracting training data from large language models. In: USENIX security symposium, vol 6
Wang G, Zhang L, Yang Z, Li X-Y (2018) Socialite: social activity mining and friend auto-labeling. In: 2018 IEEE 37th international performance computing and communications conference (IPCCC). IEEE, pp 1–8
Han F, Zhang L, You X, Wang G, Li X-Y (2019) Shad: privacy-friendly shared activity detection and data sharing. In: 2019 IEEE 16th international conference on mobile ad hoc and sensor systems (MASS). IEEE, pp 109–117
Chen T, Zhai X, Ritter M, Lucic M, Houlsby N (2019) Self-supervised gans via auxiliary rotation loss. In: CVPR
Abnar S, Dehghani M, Neyshabur B, Sedghi H (2021) Exploring the limits of large scale pre-training. arXiv
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: NAACL-HLT
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In: EMNLP. https://doi.org/10.3115/v1/d14-1113
Zhou P, Qi Z, Zheng S, Xu J, Bao H, Xu B (2016) Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: COLING
Hui DU, Xueke XU, Dayong WU, Liu Y, Zhihua YU, Cheng X (2017) A sentiment classification method based on sentiment-specific word embedding. J Chin Inf Process 31(3):170–176
Liu Y, Ma C, Zhang Y (2017) Hierarchical machine translation model based on deep recursive neural network. Chin J Comput 40(4):861–871
Liang X, Ren F, Liu Y, Pan L, Hou Y, Zhang Y, Yan LI (2018) N-reader: machine reading comprehension model based on double layers of self-attention. J Chin Inf Process
Zhichang Z, Zhenwen Z, Zhiman Z (2019) User intent classification based on indrnn-attention. J Comput Res Dev
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst
Lin M, Chen Q, Yan S (2013) Network in network. arXiv
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
Girshick R (2015) Fast r-cnn. In: ICCV
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: CVPR
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: CVPR
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv
Chen Q, Wang Y, Yang T, Zhang X, Cheng J, Sun J (2021) You only look one-level feature. arXiv
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: CVPR
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV
Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR
Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vision 111:98-136
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88:303–338
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European Conference on Computer Vision
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Jordan MI (1997) Serial order: a parallel distributed processing approach. In: Advances in psychology, vol 121, pp 471–495. North-Holland
Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput MIT-Press
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2008) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv
Graves A (2013) Generating sequences with recurrent neural networks. arXiv
Sundermeyer M, Schlüter R, Ney H (2012) Lstm neural networks for language modeling. In: INTERSPEECH
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv
Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: CVPR
Li C, Wand M (2016) Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: European conference on computer vision
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: CVPR
Van Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International conference on machine learning
Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with generative adversarial networks. In: International conference on machine learning
Denton E, Chintala S, Szlam A, Fergus R (2015) Deep generative image models using a laplacian pyramid of adversarial networks. arXiv
Huang X, Li Y, Poursaeed O, Hopcroft J, Belongie S (2017) Stacked generative adversarial networks. In: CVPR
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: ECCV
Cao Y, Xu J, Lin S, Wei F, Hu H (2019) Gcnet: non-local networks meet squeeze-excitation networks and beyond. In: ICCV workshops
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: ICCV
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2020) Training data-efficient image transformers & distillation through attention. arXiv
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision
Graham B, El-Nouby A, Touvron H, Stock P, Joulin A, Jégou H, Douze M (2021) Levit: a vision transformer in convnet’s clothing for faster inference. arXiv
Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, Hou Q, Feng J (2021) Deepvit: towards deeper vision transformer. arXiv
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv
Guan T, Wang J, Lan S, Chandra R, Wu Z, Davis L, Manocha D (2021) M3detr: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. arXiv
Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM (2021) Medical transformer: gated axial-attention for medical image segmentation. arXiv
Lee RCT, Chin YH, Chang SC (1976) Application of principal component analysis to multikey searching. IEEE Trans Softw Eng 3:185–193
Article Google Scholar
Ye J, Janardan R, Li Q (2004) Two-dimensional linear discriminant analysis. In: Advances in neural information processing systems vol 17 [Neural information processing systems, NIPS 2004, December 13–18, 2004, Vancouver, British Columbia, Canada], pp 1569–1576
Robinson S, Bennett R (1995) A typology of deviant workplace behaviors: a multidimensional scaling study. Acad Manag J 38:555–572
Article Google Scholar
Samko O, Marshall AD, Rosin PL (2006) Selection of the optimal parameter value for the isomap algorithm. Pattern Recogn. Lett. 9:968–979
Article Google Scholar
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Article Google Scholar
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 6:1373–1396
Article Google Scholar
Singh AP, Gordon GJ. Relational learning via collective matrix factorization. In: ACM SIGKDD, pp 650–658
Cao S, Lu W, Xu Q (2015) Grarep: learning graph representations with global structural information. In: CIKM, pp 891–900
Ou M, Cui P, Pei J, Zhang Z, Zhu W (2016) Asymmetric transitivity preserving graph embedding. In: ACM SIGKDD, pp 1105–1114
Sugiyama M, Borgwardt KM (2015) Halting in random walk kernels. In: NIPS
Kang U, Tong H, Sun J (2012) Fast random walk graph kernel. In: SIAM, pp 828–838
Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler-lehman graph kernels. J Mach Learn Res 12:2539–2561
MathSciNet Google Scholar
Erhan D, Manzagol P-A, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Artificial intelligence and statistics
Erhan D, Courville A, Bengio Y, Vincent P (2010) Why does unsupervised pre-training help deep learning? In: AISTATS
Lee JD, Lei Q, Saunshi N, Zhuo J (2020) Predicting what you already know helps: provable self-supervised learning. arXiv
Tosh C, Krishnamurthy A, Hsu D (2021) Contrastive learning, multi-view redundancy, and linear models. In: Algorithmic learning theory
Arora S, Khandeparkar H, Khodak M, Plevrakis O, Saunshi N (2019) A theoretical analysis of contrastive unsupervised representation learning. arXiv
Anwar S, Tahir M, Li C, Mian A, Khan FS, Muzaffar AW (2020) Image colorization: a survey and dataset. arXiv
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z et al (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR
Perarnau G, Van De Weijer J, Raducanu B, Álvarez JM (2016) Invertible conditional gans for image editing. arXiv
Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. arXiv
Tulyakov S, Liu M-Y, Yang X, Kautz J (2018) Mocogan: decomposing motion and content for video generation. In: CVPR
Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: ICCV
Wei C, Xie L, Ren X, Xia Y, Su C, Liu J, Tian Q, Yuille AL (2019) Iterative reorganization with weak spatial constraints: solving arbitrary jigsaw puzzles for unsupervised representation learning. In: CVPR
Ahsan U, Madhok R, Essa I (2019) Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In: WACV
Pathak D, Girshick R, Dollár P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR
Croitoru I, Bogolin S-V, Leordeanu M (2017) Unsupervised learning from video to detect foreground objects in single images. In: ICCV
Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Time-contrastive networks: self-supervised learning from video. In: ICRA
Korbar B, Tran D, Torresani L (2018) Cooperative learning of audio and video models from self-supervised synchronization. arXiv
Manning CD, Raghavan P, Schütze H (2008). Introduction to information retrieval. https://doi.org/10.1017/CBO9780511809071
Article Google Scholar
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn. https://doi.org/10.1023/A:1007614523901
Article Google Scholar
Reiter E (2018) A structured review of the validity of bleu. Comput Linguist
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation
Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL
Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP. https://doi.org/10.3115/v1/d14-1181
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: ACL
Yang M, Zhao W, Ye J, Lei Z, Zhao Z, Zhang S (2018) Investigating capsule networks with dynamic routing for text classification. In: EMNLP
Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: AAAI. https://doi.org/10.1609/aaai.v33i01.33017370
Wang Y, Sun A, Han J, Liu Y, Zhu X (2018) Sentiment analysis by capsules In: WWW. https://doi.org/10.1145/3178876.3186015
Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. In: ACL. https://doi.org/10.3115/v1/p15-1150
Zhu X, Sobhani P, Guo H (2015) Long short-term memory over recursive structures. In: ICML
Cheng J, Dong L, Lapata M (2016) Long short-term memory-networks for machine reading. In: EMNLP. https://doi.org/10.18653/v1/d16-1053
Liu P, Qiu X, Chen X, Wu S, Huang X (2015) Multi-timescale long short-term memory neural network for modelling sentences and documents. In: EMNLP. https://doi.org/10.18653/v1/d15-1280
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI
Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: EMNLP
Shen T, Zhou T, Long G, Jiang J, Zhang C (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. In: ICLR
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML
Iyyer M, Manjunatha V, Boyd-Graber JL, III HD (2015) Deep unordered composition rivals syntactic methods for text classification. In: ACL. https://doi.org/10.3115/v1/p15-1162
Miyato T, Dai AM, Goodfellow IJ (2017) Adversarial training methods for semi-supervised text classification. In: ICLR
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: AAAI
Johnson R, Zhang T (2016) Supervised and semi-supervised text categorization using LSTM for region embeddings. In: ICML
Bao Y, Wu M, Chang S, Barzilay R (2020) Few-shot text classification with distributional signatures. In: ICLR
Wu F, Jr AHS, Zhang T, Fifty C, Yu T, Weinberger KQ (2019) Simplifying graph convolutional networks. In: ICML
Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: NIPS
Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In: ACL. https://doi.org/10.18653/v1/P17-1052
Wang J, Wang Z, Zhang D, Yan J (2017) Combining knowledge with deep convolutional neural networks for short text classification. In: IJCAI. https://doi.org/10.24963/ijcai.2017/406
Huang L, Ma D, Li S, Zhang X, Wang H (2019) Text level graph neural network for text classification. In: EMNLP-IJCNLP. https://doi.org/10.18653/v1/D19-1345
Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune BERT for text classification? In: CCL. https://doi.org/10.1007/978-3-030-32381-3_16
Yang Z, Yang D, Dyer C, He X, Smola AJ, Hovy EH (2016) Hierarchical attention networks for document classification. In: NAACL-HLT
Bowman SR. Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. In: EMNLP.https://doi.org/10.18653/v1/d15-1075
Wang Z, Hamza W, Florian R (2017) Bilateral multi-perspective matching for natural language sentences. In: IJCAI. https://doi.org/10.24963/ijcai.2017/579
Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. In: ACL. https://doi.org/10.18653/v1/p19-1441
Williams A, Nangia N, Bowman SR (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: NAACL-HLT. https://doi.org/10.18653/v1/n18-1101
Marelli M, Bentivogli L, Baroni M, Bernardi R, Menini S, Zamparelli R (2014) Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval@COLING. https://doi.org/10.3115/v1/s14-2001
Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: COLING
Fu J, Liu P, Neubig G (2020) Interpretable multi-dataset evaluation for named entity recognition. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.489
Lester B, Pressel D, Hemmeter A, Choudhury SR, Bangalore S (2020) Constrained decoding for computationally efficient named entity recognition taggers. In: EMNLP. https://doi.org/10.18653/v1/2020.findings-emnlp.166
Luo Y, Zhao H, Zhan J (2020) Named entity recognition only from word embeddings. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.723
Li X, Feng J, Meng Y, Han Q, Wu F, Li J (2020) A unified MRC framework for named entity recognition. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.519
Zhang Y, Yang J (2018) Chinese NER using lattice LSTM. In: ACL. https://doi.org/10.18653/v1/P18-1144
Meng Y, Wu W, Wang F, Li X, Nie P, Yin F, Li M, Han Q, Sun X, Li J (2019) Glyce: Glyph-vectors for Chinese character representations. In: NeurIPS
Katiyar A, Cardie C (2018) Nested named entity recognition revisited. In: NAACL-HLT. https://doi.org/10.18653/v1/n18-1079
Wang B, Lu W (2018) Neural segmental hypergraphs for overlapping mention recognition. In: EMNLP. https://doi.org/10.18653/v1/d18-1019
Luan Y, Wadden D, He L, Shah A, Ostendorf M, Hajishirzi H (2019) A general framework for information extraction using dynamic span graphs. In: NAACL-HLT. https://doi.org/10.18653/v1/n19-1308
Shibuya T, Hovy EH (2020) Nested named entity recognition via second-best sequence learning and decoding. Trans Assoc Comput Linguist 8:605–620
Lin H, Lu Y, Han X, Sun L (2019) Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. In: ACL. https://doi.org/10.18653/v1/p19-1511
Lai G, Xie Q, Liu H, Yang Y, Hovy EH (2017) RACE: large-scale reading comprehension dataset from examinations. In: EMNLP. https://doi.org/10.18653/v1/d17-1082
Yang Y, Yih W, Meek C (2015) Wikiqa: a challenge dataset for open-domain question answering. In: EMNLP. https://doi.org/10.18653/v1/d15-1237
Santos CN, Tan M, Xiang B, Zhou B (2016) Attentive pooling networks. CoRR arXiv:1602.03609
Lee JY, Dernoncourt F (2016) Sequential short-text classification with recurrent and convolutional neural networks. In: NAACL-HLT. https://doi.org/10.18653/v1/n16-1062
Kim S, D’Haro LF, Banchs RE, Williams JD, Henderson M (2016) The fourth dialog state tracking challenge. In: Dialogues with social robots—enablements, analyses, and evaluation, seventh international workshop on spoken dialogue systems, IWSDS 2016, Saariselkä, Finland, January 13–16, 2016. https://doi.org/10.1007/978-981-10-2585-3_36
Ang J, Liu Y, Shriberg E (2005) Automatic dialog act segmentation and classification in multiparty meetings. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005.https://doi.org/10.1109/ICASSP.2005.1415300
Wan Y, Yan W, Gao J, Zhao Z, Wu J, Yu PS (2018) Improved dynamic memory network for dialogue act classification with adversarial training. In: IEEE international conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10–13, 2018. https://doi.org/10.1109/BigData.2018.8622245
Raheja V, Tetreault JR (2019) Dialogue act classification with context-aware self-attention. In: Proc. NAACL, 2019. https://doi.org/10.18653/v1/n19-1373
Xu J, Gan Z, Cheng Y, Liu J (2020) Discourse-aware neural extractive text summarization. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.451
Zou Y, Zhang X, Lu W, Wei F, Zhou M (2020) Pre-training for abstractive document summarization by reinstating source text. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.297
Liu L, Lu Y, Yang M, Qu Q, Zhu J, Li H (2018) Generative adversarial network for abstractive text summarization. In: AAAI
Yang M, Qu Q, Tu W, Shen Y, Zhao Z, Chen X (2019) Exploring human-like reading strategy for abstractive text summarization. In: AAAI. https://doi.org/10.1609/aaai.v33i01.33017362
Bhandari M, Gour PN, Ashfaq A, Liu P, Neubig G (2020) Re-evaluating evaluation in text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.751
Dong Y, Wang S, Gan Z, Cheng Y, Cheung JCK, Liu J (2020) Multi-fact correction in abstractive text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.749
Huang D, Cui L, Yang S, Bao G, Wang K, Xie J, Zhang Y (2020) What have we achieved on text summarization? In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.33
Kryscinski W, Paulus R, Xiong C, Socher R (2018) Improving abstraction in text summarization. In: EMNLP. https://doi.org/10.18653/v1/d18-1207
Kryscinski W, McCann B, Xiong C, Socher R (2020) Evaluating the factual consistency of abstractive text summarization. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.750
Kouris P, Alexandridis G, Stafylopatis A (2019) Abstractive text summarization based on deep learning and semantic content generalization. In: ACL. https://doi.org/10.18653/v1/p19-1501
Chen K, Wang R, Utiyama M, Sumita E (2020) Content word aware neural machine translation. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.34
Lin Z, Pan X, Wang M, Qiu X, Feng J, Zhou H, Li L (2020) Pre-training multilingual neural machine translation by leveraging alignment information. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.210
Bugliarello E, Okazaki N (2020) Enhancing machine translation with dependency-aware self-attention. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.147
Aji AF, Bogoychev N, Heafield K, Sennrich R (2020) In neural machine translation, what does transfer learning transfer? In: ACL. https://doi.org/10.18653/v1/2020.acl-main.688
Baziotis C, Haddow B, Birch A (2020) Language model prior for low-resource neural machine translation. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.615
Cui Q, Huang S, Li J, Geng X, Zheng Z, Huang G, Chen J (2021) Directqe: Direct pretraining for machine translation quality estimation. In: AAAI
Wu C, Hoi SCH, Socher R, Xiong C (2020) TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In: EMNLP. https://doi.org/10.18653/v1/2020.emnlp-main.66
Campagna G, Foryciarz A, Moradshahi M, Lam MS (2020) Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. In: ACL. https://doi.org/10.18653/v1/2020.acl-main.12
Liu Q, Yu L, Rimell L, Blunsom P (2021) Pretraining the noisy channel model for task-oriented dialogue. CoRR arXiv:2103.10518
SST Corpus. http://nlp.stanford.edu/sentiment (2013)
Pang B, Lee L (2005) Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL
Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv
Hendrickx I, Kim SN, Kozareva Z, Nakov P, Séaghdha DÓ, Padó S, Pennacchiotti M, Romano L, Szpakowicz S (2009) Semeval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In: Proc. NAACL, 2009
Wiebe J, Wilson T, Cardie C (2005) Annotating expressions of opinions and emotions in language. Lang Resour Eval. https://doi.org/10.1007/s10579-005-7880-9
Article Google Scholar
MPQA Corpus. http://www.cs.pitt.edu/mpqa/ (2005)
Diao Q, Qiu M, Wu C, Smola AJ, Jiang J, Wang C (2014) Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In: ACM SIGKDD. https://doi.org/10.1145/2623330.2623758
20NG Corpus. http://ana.cachopo.org/datasets-for-single-label-text-categorization (2007)
AG Corpus. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html (2004)
Reuters Corpus. https://www.cs.umb.edu/~smimarog/textmining/datasets/ (2007)
Reuters Corpus. https://martin-thoma.com/nlp-reuters (2017)
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Kleef P, Auer S, Bizer C (2015) Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web. https://doi.org/10.3233/SW-140134
Article Google Scholar
Ohsumed Corpus (2015) http://davis.wpi.edu/xmdv/datasets/ohsumed.html
Williams A, Nangia N, Bowman SR (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv
Levesque H, Davis E, Morgenstern L (2012) The winograd schema challenge. In: Thirteenth international conference on the principles of knowledge representation and reasoning
Dolan WB, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: IWP
Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: unanswerable questions for squad. arXiv
Lai G, Xie Q, Liu H, Yang Y, Hovy E (2017) Race: large-scale reading comprehension dataset from examinations. arXiv
Jurafsky D, Shriberg E (1997) Switchboard swbd-damsl shallow-discourse-function annotation coders manual
Li J, Zhou P, Xiong C, Socher R, Hoi SC (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966
Donahue J, Simonyan K (2019) Large scale adversarial representation learning. Adv Neural Inf Process Syst 32
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2021) Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377
http://yann.lecun.com/exdb/mnist/
http://ufldl.stanford.edu/housenumbers/
https://www.cs.toronto.edu/~kriz/index.html
Coates A, Ng A, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics
https://cs.stanford.edu/~acoates/stl10/
http://www.vision.caltech.edu/Image_Datasets/Caltech101/
Miller GA (1998) WordNet: an electronic lexical database
https://image-net.org/
https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCV
https://www.crcv.ucf.edu/data/UCF101.php
https://www.crcv.ucf.edu/data/UCF50.php
Bossard L, Guillaumin M, Van Gool L (2014) Food-101—mining discriminative components with random forests. In: European conference on computer vision
Berg T, Liu J, Woo Lee S, Alexander ML, Jacobs DW, Belhumeur PN (2014) Birdsnap: large-scale fine-grained visual categorization of birds. In: CVPR
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition
Xiao J, Ehinger KA, Hays J, Torralba A, Oliva A (2016) Sun database: exploring a large collection of scene categories. Int J Comput Vis 119:3–22
Article MathSciNet Google Scholar
http://places.csail.mit.edu/downloadData.html
http://ai.stanford.edu/~jkrause/cars/car_dataset.html
Maji S, Kannala J, Rahtu E, Blaschko,M. Vedaldi A (2013) Fine-grained visual classification of aircraft. Technical report
https://sites.google.com/site/fgcomp2013/
https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
https://www.robots.ox.ac.uk/~vgg/data/pets/
https://www.robots.ox.ac.uk/~vgg/data/flowers/
https://www.robots.ox.ac.uk/~vgg/data/dtd/
https://sites.google.com/view/fgvc5/competitions/inaturalist
https://www.inaturalist.org/
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852
http://host.robots.ox.ac.uk/pascal/VOC/
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html
http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: CVPR
Zhou B, Zhao H, Puig X, Xiao T, Fidler S, Barriuso A, Torralba A (2019) Semantic understanding of scenes through the ade20k dataset. Int J Comput Vis 127:301–321
Article Google Scholar
https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
Cordts M, Omran M, Ramos S, Scharwächter T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2015) The cityscapes dataset. In: CVPR workshop on the future of datasets in vision
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE conference on computer vision and pattern recognition (CVPR)
Gupta A, Dollar P, Girshick R (2019) LVIS: A dataset for large vocabulary instance segmentation. In: CVPR
https://davischallenge.org/
https://davischallenge.org/davis2017/code.html
Doersch C. Data analysis project: what makes Paris look like Paris?
http://www.cs.toronto.edu/~nitish/unsupervised_video/
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) Yfcc100m: the new data in multimedia research. Commun ACM
http://projects.dfki.uni-kl.de/yfcc100m/
Jin W, Liu X, Zhao X, Ma Y, Shah N, Tang J (2021) Automated self-supervised learning for graphs. CoRR arXiv:2106.05470
Peng Z, Dong Y, Luo M, Wu X, Zheng Q (2020) Self-supervised graph representation learning via global context prediction. CoRR arXiv:2003.01604
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020) Deep graph contrastive representation learning. CoRR arXiv:2006.04131
Jin M, Zheng Y, Li Y, Gong C, Zhou C, Pan S (2021) Multi-scale contrastive siamese networks for self-supervised graph representation learning. CoRR arXiv:2105.05682
Hu Z, Fan C, Chen T, Chang K, Sun Y (2019) Pre-training graph neural networks for generic structural feature extraction. CoRR arXiv:1905.13728
Zhu Y, Xu Y, Yu F, Wu S, Wang L (2020) Cagnn: cluster-aware graph neural networks for unsupervised graph representation learning. arXiv preprint arXiv:2009.01674
Zhang H, Lin S, Liu W, Zhou P, Tang J, Liang X, Xing EP (2020) Iterative graph self-distillation. CoRR arXiv:2010.12609
Lin S, Zhou P, Hu Z-Y, Wang S, Zhao R, Zheng Y, Lin L, Xing E, Liang X (2021) Prototypical graph contrastive learning. IEEE trans neural networks learn syst 35(2):2747–2758
Subramonian A (2021) Motif-driven contrastive learning of graph representations. Proc AAAI Conf Artif Intell 35:15980–15981
Google Scholar
Opolka FL, Solomon A, Cangea C, Velickovic P, Liò P, Hjelm RD (2019) Spatio-temporal deep graph infomax. CoRR

Download references

Author information

Ce Zhou, Qian Li, Chen Li, Jun Yu and Yixin Liu contributed equally to this work.

Authors and Affiliations

Michigan State University, East Lansing, MI, USA
Ce Zhou, Guangjing Wang & Qiben Yan
Beihang University, Beijing, China
Qian Li, Chen Li, Cheng Ji, Hao Peng & Jianxin Li
Lehigh University, Bethlehem, PA, USA
Jun Yu, Yixin Liu, Kai Zhang, Lifang He & Lichao Sun
Macquarie University, Sydney, Australia
Jia Wu
Nanyang Technological University, Singapore, Singapore
Ziwei Liu
University of California San Diego, San Diego, CA, USA
Pengtao Xie
Salesforce AI Research, Palo Alto, CA, USA
Caiming Xiong
Duke University, Durham, NC, USA
Jian Pei
University of Illinois at Chicago, Chicago, IL, USA
Philip S. Yu

Authors

Ce Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qian Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yixin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guangjing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Ji
View author publications
You can also search for this author in PubMed Google Scholar
Qiben Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lifang He
View author publications
You can also search for this author in PubMed Google Scholar
Hao Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jia Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Pengtao Xie
View author publications
You can also search for this author in PubMed Google Scholar
Caiming Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pei
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar
Lichao Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ce Zhou or Qian Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Basic components

1.1 A.1 Basic components on NLP

Table 4 Commonly used notations on NLP and graph

Full size table

1.1.1 A.1.1 Language model

With the rapid development of deep learning, LMs are more and more applicable to the pretraining of NLP models. The LM can estimate the probability of rationality of a paragraph of the text. There are two main types of LMs: statistical LM and neural network LM.

Statistical LM The statistical LM is a mathematical model to solve the context-related characteristics of natural language from the perspective of probability and statistics. The core of statistical LMs is to determine the probability of a sentence appearing in a text. As the theoretical basis of the probabilistic LM, the N-gram model profoundly influences the subsequent LM. It plays a pivotal role in the field of the LM. The N-gram LM introduces the Markov hypothesis, which assumes that the probability of the occurrence of the current word only depends on the nearest $n-1$ words. The maximum likelihood probability of a word $w_{i}$ can be calculated by

$$\begin{aligned} {\begin{matrix} p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{N} \right) & = p\left( w_{i} \mid w_{i-n+1}, w_{i-n+2}, \ldots , w_{i-1} \right) \\ & = \frac{C\left( w_{i-n+1}, w_{i-n+2}, \ldots , w_{i}\right) }{\sum _{N} C\left( w_{i-n+1}, w_{i-n+2}, \ldots , w_{i-1} \right) }, \end{matrix}} \end{aligned}$$

(16)

where $T=[w_{1}, w_{2}, \ldots , w_{N}]$ is the text sequence and $C(w_{i-n+1}, w_{i-n+2}, \ldots , w_{i})$ is the co-occurrence frequency of $(w_{i-n+1}, w_{i-n+2}, \ldots , w_{i})$. The $p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{N} \right)$ is calculated according to the chain rule

$$\begin{aligned} p\left( w_{1}, w_{2}, \ldots , w_{N}\right) =\prod _{i=1}^{N} p\left( w_{i} \mid w_{1}, w_{2}, \ldots , w_{i-1}\right) . \end{aligned}$$

(17)

N-gram uses the probabilities of each word in the sequence to represent the co-occurrence probability of the whole text sequence. When N is large, it indicates a more vital constraint on the occurrence of the next word in the sequence and leads to more sparse frequency information. When N is small, the statistical results have higher reliability and better generalization ability, but the constraint will be weaker.

Neural LM The statistical LM adopts maximum likelihood estimation, which is intuitive and easy to understand. However, there are still problems such as a lack of long-term dependence, the rapid growth of parameter space and sparse data. Therefore, the neural network is introduced to map the LM to a continuous space. Neural LMs use distributed representations of words to model natural language sequences. Unlike class-based N-gram models, neurolinguistic models are able to recognize two similar words without losing the ability to encode each word as different from the other. It can be directly used for NLP tasks. It mainly introduces Forward Feedback Neural Networks (FFNN), Recurrent Neural Networks (RNN), and pretrained LMs.

As shown in Fig. 15a, FFNN according to the all former words of $x=[w_{1}, \ldots , w_{i-1}]$ calculates the probability of $w_{i}$. In order to predict the conditional probability of $w_{i}$, x is sharing the projection matrix $M \in R^{|V| \times m}$ to a continuous feature vector space according to the projection index, |V| is word library size, m is the dimension of the feature vector. The output is represented as

$$\begin{aligned} y=b_{2}+H^{1,3}_{x}+H^{2,3}_{x} \tanh (b_{1}+H^{1,2}_{x}), \end{aligned}$$

(18)

where $H^{1,2}$, $H^{2,3}$ and $H^{1,3}$ are the weight matrices used to connect each layer, and $b_{1}$ and $b_{2}$ are the bias values of the hidden layer and the output layer respectively.

The structure of the FFNN contains only limited information about the foregoing and has some limitations on the length of the input sequence. Therefore, the RNN LM comes into being. As shown in Fig. 15b, RNN can accept input of any variable length. When the input window is moved, its internal state mechanism can avoid repeated calculation, and parameter sharing further reduces the number of model parameters. Therefore, compared with FFNN, RNN has a great advantage.

The pretraining LM is to get a set of model parameters by pretraining some tasks. It initializes the model with these parameters and then trains to improve the model performance effectively. The commonly used pretraining models are fixed embedding (Word2vec [12], Glove [69], etc), variable embedding (Embeddings from LMs (ELMO) [303], Generative pretrained Transformer (GPT) [50] and Bidirectional Encoder Representations from Transformers (BERT) [13], etc). Here, we give an example of the GPT model, as shown in Fig. 15c. It adopts a two-stage process. In the first stage, the Transformer decoder is used as the basic unit of the model to perform text prediction. In the second stage, the GPT is initialized differently for different downstream tasks, training the model and fine-tuning the parameters.

1.2 A.2 Basic components on GL

Due to the extensive use of graph data in many fields, some communities (e.g., chemistry, protein, and social network) have recently focused on the study of graph pretraining. These pretraining models encode graph attributes, structures, and other information into node representations from multiple perspectives by designing different pretext tasks, which are used to optimize downstream tasks. In this section, we introduce the definition of the basic concepts of graphs, and then provide a formal definition of the PFM on the graph.

1.2.1 A.2.1 Notations and definitions of graphs

Unless particularly specified, the notations used in this article are illustrated in Table 4. We use $\mathcal {G}=\{G_{i}\}^{N}_{i}$ to represent a set of graphs, where N represents the number of graphs. Depending to the graph’s definition of the edges and nodes, graph data can be classified into the following types.

Definition 1

An unattributed graph is $G=(V,E)$, where $v \in V$ is a node, $e \in E$ is an edge, and naturally $E \subseteq V \times V$. Adjacency matrix $A \in \mathbb {R}^{n \times n}$ represents the topology of graph G, where $n=|V|$. $A_{i,j}=1$ denotes there is an edge between node $v_{i}$ and $v_{j}$, otherwise $A_{i,j}=0$.

Definition 2

An attributed graph is $G=(V,E,X_{v},X_{e})$, where $X_{v} \in \mathbb {R}^{n \times d_{v}}$ and $X_{e} \in \mathbb {R}^{m \times d_{e}}$ are the feature matrices of nodes and edges, $|V|=n$, $|E|=m$, $d_{v}$ and $d_{e}$ denotes the feature dimensions of node and edge. In fact, in most application scenarios, only nodes have attributes, and edges have no attributes or only weights.

Definition 3

An undirected graph is $G=(V,E)$, where $e_{i,j} \in E$ means an unordered node pair $(v_{i}, v_{j})$. In particular, the adjacency matrix A of the undirected graph is a symmetric matrix (i.e., $A_{i,j}=A_{j,i}$).

Definition 4

A directed graph is $G=(V,E)$, where $e_{i,j} \in E$ means an ordered node pair $(v_{i}, v_{j})$.

Definition 5

G has a node-type mapping function $f_{v}: V \rightarrow \mathcal {T}^{v}$ and an edge-type mapping function $f_{e}: E \rightarrow \mathcal {T}^{e}$. When $|\mathcal {T}^{v}|=|\mathcal {T}^{e}|=1$, the graph $G=(V,E)$ is a homogeneous graph. In other words, all nodes in G belong to a type, and all edges also belong to one type.

Definition 6

When $|\mathcal {T}^{v}|>1$ and/or $|\mathcal {T}^{e}|>1$, the graph $G=(V,E)$ is a heterogeneous graph. In particular, a heterogeneous graph must be an attributed graph.

1.2.2 A.2.2 Learning settings on graphs

GL methods are usually used to solve machine learning tasks on graph data, and we introduce different settings (supervision mode and learning mode) for GL.

Before that, we first provide the notations of the corresponding mathematical formulation of GL. $C=\{c_{1}, c_{2}, \cdots , c_{K}\}$ is a set of target components defined in a graph set $\mathcal {G}$ ($G^{c_{i}} \in \mathcal {G}$), and $c_{i}$ is associated with a corresponding ground truth label $y_{i} \in \mathcal {Y}=\{1,2,\cdots ,N_{y}\}$, where K denotes the total number of target components, and $N_{y}$ is the number of classes being predicted. Then the graph data can be represented as $D=\{c_{i}, G^{c_{i}}, y_{i}\}^{K}_{i}$, and a complete GL model $M_{GL}$ can also be determined by $y_{i} = M_{GL}(c_{i}, G^{c_{i}})$. For instance, in a node classification task, $c_{i}$ is the node to be classified, $y_{i}$ denotes $c_{i}$’s label in graph $G^{c_{i}}$. Similarly, in a node clustering task, $c_{i}$ is the node to be clustered, $y_{i}$ denotes the corresponding cluster label in graph $G^{c_{i}}$.

Supervision mode

Depending on the source and scale of the training data, the supervision settings of GL can be divided into four types as shown in Fig. 16. Supervised GL is the most common mode in the real scenario. Given the target component $c_{i}$ and the corresponding ground truth label $y_{i}$, the goal is to minimize the loss function between the predicted label of the GL model (i.e., $y^{pred}_{i}=M_{GL}(c_{i}, G^{c_{i}})$) and the expected label $y_{i}$ of all $c_{i}$. Compared with supervised learning, unsupervised GL refers to situations in which no labeled data is provided, only the attributes and structure distribution of graph data (i.e., $(c_{i}, G^{c_{i}})$) can be used. Self-supervised GL is a special case of both supervised and unsupervised learning. Specifically, self-supervised learning mainly uses pretext tasks (e.g., clustering, completion, and partition) to mine its own supervised information (i.e., pseudo-labels) from large-scale unsupervised graph data, and trains the GL model $M_{GL}$ through the self-supervised information, so that it can learn to the valuable features of downstream tasks. In other words, the supervised information of self-supervised learning is not manually labeled, but the pretext tasks automatically construct supervised information from large-scale unsupervised data for supervised learning or training. Semi-supervised learning is a combination of unsupervised and supervised learning, which aims at learning data distribution to predict unlabeled data to solve the problem of difficulty in obtaining labeled data in real scenarios. In GL, semi-supervised learning refers to the realization of pattern recognition given a few labeled data and mass unlabeled data.

Learning mode The GL model $M_{GL}$ is optimized by the given training samples, and adjusted on the validation samples to participate in the test. According to the visibility of the graph data at different stages, the learning settings of GL model $M_{GL}$ can be classified into two categories: inductive learning and transductive learning.

Definition 7

Inductive Learning, which is the most common setting in machine learning tasks, trains the model on labeled data and then tests on samples that have never appeared in the training stage. Formally, given a training sample $\{(c_{i}, G^{c_{i}}, y_{i})\}^{N_{l}}_{i=1}$, $\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}$, where $N_{l}$ and $N_{u}$ are the numbers of labeled/unlabeled samples. Inductive learning learns a function $f^{ind}: \mathcal {G} \mapsto \mathcal {Y}$ so that $f^{ind}$ is expected to be a good classifier on the future graph data $\{(c_{k}, G^{c_{k}})\}$, beyond $\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}$.

Definition 8

Transductive Learning is different from inductive learning in that all samples are visible during both the training and testing stages. Formally, given a training sample $\{(c_{i}, G^{c_{i}}, y_{i})\}^{N_{l}}_{i=1}$, $\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}$, transductive learning learns a function $f^{trans}: \mathcal {G}^{l+u} \mapsto \mathcal {Y}^{l+u}$ so that $f^{trans}$ is expected to be a good classifier on the unlabeled data $\{(c_{j}, G^{c_{j}})\}^{N_{u}}_{j=1}$.

Under the supervised setting (including semi-/self-supervised), the unified classifier optimization methods of inductive learning and transductive learning can be written as:

$$\begin{aligned} \mathcal {L}= \frac{1}{K} \sum ^{K}_{i=1} \mathcal {L}(f^{(\cdot )}_\theta (c_{i}, G^{c_i}), y_i), \end{aligned}$$

(19)

where $\mathcal {L}$ is the cross-entropy loss, $c_{i}$ can be node, edge or subgraph of its associated graph $G^{c_{i}}$, and $f^{(\cdot )}_\theta$ denotes inductive/transductive function with parameter $\theta$.

Compared with using only one pretext task, some methods have designed some integration mechanisms to incorporate the advantages of multiple pretext tasks into a unified framework.

B Traditional learning methods

1.1 B.1 Traditional text learning

NLP is a research field that integrates linguistics and computer science. Its main research tasks include part-of-speech tagging, named entity recognition, semantic role labeling, machine translation, question answering, sentiment analysis, text summarization, text classification, relationship extraction, event extraction, etc. The LM can be considered the cornerstone of the downstream NLP tasks. It experiences four processes: grammar rule LM, probabilistic LM, neural network LM, and pretraining LM. A PFM trains on a large benchmark dataset to obtain a model that can solve new similar tasks, which has become a new hotspot in current LM research.

Word representations play a significant role in downstream tasks, which is the basis of NLP. The N-gram model preprocesses text features and encodes adjacent N words as a group, which makes it overly dependent on the richness of the training corpus. Otherwise, data-sparse is likely to occur, and the computational complexity will increase exponentially with the increase of N. Neural Network LM (NNLM) [11] adopts the idea of word vector for the first time, and the low-dimensional word vector of distributed representation can solve the discrete problem caused by word embedding well. However, it is still challenging to solve the problem of high computational complexity. The computational complexity of the word2vec model is independent of the selected window size but is determined by the dictionary size and the word vector dimension. Many downstream tasks can be significantly improved by training on a large corpus using word vector embedding after initial training. However, the problem of polysemy for the static word vector is still unsolved, and it still belongs to the shallow LM [304, 305]. Therefore, more effective models are urgently needed to deal with the data set more flexibly. To capture high-level concepts of context, such as polysemy elimination, syntactic structure, etc. Neelakantan et al. [306] propose to learn multiple embeddings per word type. Zhou et al. [307] integrate the features on both dimensions of the matrix to enrich semantics by using subword information. Based on the Continuous Bag Of Words (CBOW) [12] in word2vec, Hui et al. [308] fine-tune the generated word vectors for emotion and obtain the word vectors containing both semantic meaning and emotional tendency, which significantly improved the performance in the Weibo sentiment classification task. Liu et al. [309] propose a model of hierarchical translation for machine translation. It uses the neural LM based on RNN as the word vector generation model. Liang et al. [310] propose an approach based on the double-layer self-attention mechanism for machine reading comprehension, and the model is divided into three parts: single document encoder, multi-document encoder, and answer prediction. In the single document encoder, the problem of the context information is represented by the Gated Recurrent Unit (GRU) model. Zhang et al. [311] propose an INDependent RNN (INDRNN) and attention mechanism for user intention classification, using word vectors generated by word2vec as input. The model introduces a word-level attention mechanism to effectively quantify the contribution of domain vocabulary to the intention category.

1.2 B.2 Traditional image learning

There are several types of neural networks in the deep learning era, from the beginning of most famous convolutional neural networks (CNNs) to the subsequent Attention- and Transformer-based networks. A deep neural network refers to an artificial neural network with more hidden layers, and more parameters are used to represent the target model, which leads to the SOTA performance on the benchmark dataset from image to video. Here, we introduce the milestone networks in CV chronologically.

1.2.1 B.2.1 Convolution-based networks

ImageNet [312], as one of the most important databases in computer vision, has aroused many milestone network architectures in image classification, including AlexNet [313], NIN [314], VGG [315], GoogLeNet [316], ResNet [317], DenseNet [318], etc. When it comes to object detection and semantic segmentation, researchers explore R-CNNs [319,320,321,322], FCN [323], SSD [324], YOLOs [325,326,327,328,329], SegNet [330], PSPNet [331], Deeplabs [332,333,334,335], RefineNet [336], etc. on common benchmark datasets, such as PASCAL VOC [337, 338], MS COCO [339], etc.

There are several shared features among these popular convolution-based networks: (1) data augmentation. Deep models require much more data to fit a complicated model, thus the data augmentation technique such as flipping, rotation, cropping, scaling, translation, and even adding noises enlarges the training dataset; (2) convolution. The convolutional kernel is used to extract the features of original image data, which maintains the spatial structure for the adjacent pixels; (3) deep architecture. The deep architecture contains more parameters, which enhance the capability of the model. These common features have contributed to the SOTA performance of convolutional neural networks (CNNs) in computer vision for nearly recent 10 years.

1.2.2 B.2.2 Recurrent neural networks

Different from CNNs targeting 2D-dimensional image applications, recurrent neural networks (RNNs) [340,341,342] try to use recursive cells to process pictures in sequence, i.e., video data. However, the weaknesses of gradient explosion and long-term dependencies restrict further development of this model. To handle these problems embedded inside the RNN-based models, long short-term memory (LSTM) [343] was proposed by Hochreiter and Schmidhuber in 1997. In addition, the improved capability of LSTMs produces popularity and attracts attention both in NLP and CV [344,345,346,347,348].

1.2.3 B.2.3 Generation-based networks

Generative Adversarial Networks (GANs) [349] have provided a paradigm to learn representations for unlabelled data, and spawn many GAN-based approaches on downstream tasks. In image translation, pix2pix software [350] first proposes the conditional adversarial networks as a solution to the image-to-image translation problems, and achieves reasonable results on real-world datasets. Markovian Generative Adversarial Networks (MGANs) [351] is a method to generate texture synthesis, which can be applied to style transfer and video stylization. CycleGAN [352] provides a learning algorithm to translate an original image from the source domain to a target domain without containing pairs of images in datasets for supervised learning. StyleGAN [353] is a style-based generator to serve as an alternative architecture for traditional GANs. Pixel Recurrent Neural Networks (PixelRNN) [354] aims to complete images by modeling full dependencies between the color channels. DiscoGAN [355] is designed to learn relations between different domains.

GANs have also provided a novel direction to study data synthesis because it perfectly simulates the distribution of the original data. Laplacian Pyramid of Adversarial Networks (LAPGAN) [356] uses a cascade of convolutional networks to generate images in a coarse-to-fine fashion. Similarly, Stacked Generative Adversarial Networks (SGAN) [357] decompose variations into multiple levels and gradually resolve uncertainties by stacking several GANs in a top-down way.

1.2.4 B.2.4 Attention-based networks

Based on the success of CNNs in the area of CV, the attention module is designed to be equipped with the popular CNNs. For example, SENet [358] proposes a channel attention module, which won first place in the competition of ILSVRC2017. In addition, CBAM [359] sequentially infers attention maps along both channel and spatial dimensions. Many innovative works, such as GCNet [360] and CCNet [361], are inspired by this idea of soft-attention mechanism, which outperforms the traditional CNNs on major benchmarks for both recognition and segmentation tasks. In particular, the self-attention mechanism [362], calculating the response at a position among all entities in a sequence by attending to all positions within the same sequence, is proposed to estimate the relevance of one position to other positions in feature maps. To control the expected entities and model more complex relations among different elements in the sequence, masked self-attention and multi-head attention [38] are the key components proposed to substitute the function of convolutions in the era of transformers.

1.2.5 B.2.5 Transformer-based networks

Recently, inspired by the self-attention mechanism and subsequent success of the transformer in NLP, researchers in CV also have tried to use the transformer as an alternative to the convolution. Self-attention-based transformer models always operate in a two-stage training mechanism: (1) pretraining on a primitive dataset (always big but not well labeled) by defining pretext tasks; (2) transferring the pretrained weights to the downstream tasks and adjusting the parameters on the target domain dataset by finetuning. Vision Transformer (ViT) [40] is applied on CV and achieves the SOTA performance on major benchmark datasets. Data-efficient image Transformers (DeiT) [363]was proposed by Facebook AI to train image transformers more efficiently and maintain the SOTA performance simultaneously. DEtection TRansformer (DETR) [364] significantly outperforms competitive baselines in both object detection and semantic segmentation. LeViT [365] outperforms existing benchmarks with respect to balancing the accuracy and training speed. Image GPT [149] is inspired by a sequence transformer in NLP, which can compete with several self-supervised benchmarks on ImageNet. On the basis of this research, DeepViT [366] explores a deeper architecture to improve performance consistently by making the transformer go deeper. Moreover, many researchers try to apply the transformer to more specific tasks. Pyramid Vision Transformer (PVT) [367] introduces the pyramid structure to overcome the difficulties of porting the transformer to various dense prediction tasks, and achieves the SOTA performance on major benchmark datasets. M3DeTR [368] is a novel research on multi-representation, multi-scale, and mutual-relation 3D object detection with transformers. Medical Transformer (MedT) [369] has focused on medical image segmentation and outperforms previous CNN-based and transformer-based architecture. In conclusion, the transformer has become a novel and popular research area in CV and its performance is proved by many existing works.

1.3 B.3 Traditional graph learning

GL aims to embed the graph as a low-dimensional representation while preserving the desired properties of the original graph data. Classical GL methods are usually implemented using statistical methods or artificially designed components.

Dimension reduction As a commonly used method in feature engineering, dimension reduction aims to reduce the dimension of high-dimensional attribute graph data into a lower-dimensional representation. In GL, it highlights the remaining information at the cost of losing part of the attributes. According to different dimensionality reduction strategies, such methods can be classified into two types. The first type is subspace learning under the linear assumption. Based on the assumption that the principal components [370] related to the larger variance represent important structural information, and those smaller variances represent noise, principal component analysis calculates a low-dimensional representation that maximizes the variance of the data. Linear Discriminant Analysis (LDA) [371] achieves dimension reduction by maximizing the ratio of inter-class scattering and intra-class scattering to obtain a linear projection matrix. Multi-Dimensional Scaling (MDS) [372] is a distance-maintaining manifold learning method. It produces a mapping in a lower dimension to preserve dissimilarities between nodes as much as possible. The second type is nonlinear dimension reduction, which aims to automatically learn nonlinear topology to achieve manifold learning. Isomap [373] first constructs a neighborhood graph on the manifold and calculates the shortest path between pairs of nodes, and then uses MDS to construct a low-dimensional embedding. Locally Linear Embedding (LLE) [374] first allocates neighbors for each node. Then, it calculates the weighted $W_{i,j}$, the best linear reconstruction feature $X_{i}$ from its neighbors. Finally, calculate the low-dimensional embedding for the optimal reconstruction of $W_{i,j}$.

Matrix factorization Greatly influenced by the idea of dimension reduction, the models based on matrix factorization emerged in the early research of GL. Such models aim to reconstruct the adjacency matrix of the graph to achieve dimension reduction while maintaining structural information. Although these models have significant limitations, in fact, their ideas still inspire many current studies. Depending on how the matrix is constructed, such methods often append specific constraints. Graph Laplacian eigenmaps [375] minimizes a loss function to ensure that nodes close to each other on the manifold are mapped into the low-dimensional space and still maintain the local distances. Node proximity matrix factorization [376] minimizes the objective function $|W-YY^{cT}|$ through matrix factorization to approximate the proximity of nodes in the low-dimensional space, where Y and $Y^{c}$ are the embeddings for nodes and context nodes, and W is the default node proximity matrix. GraRep [377] aims to preserve the high-order proximity of graphs in the embedding space, thus it derives a k-th order transition matrix, $A^{k}$, by multiplying the adjacency matrix to itself k times. The transition probability from node $v_{i}$ to node $v_{j}$ is the entry in the i-th row and j-th column of the k-th order transition matrix, i.e., $p_{k}(v_{i}|v_{j})=A^{k}_{i,j}$. Then GraRep defines the loss function using the skip-gram model and negative sampling. To capture the high-order proximity between node pairs, HOPE [378] preserves asymmetric transitivity in approximating the high-order proximity. Specifically, the goal of HOPE is to minimize the objective function $||S-WC^{T}||^{2}_{F}$, where the elements $s_{i,j} \in S$ represent a certain edge feature (e.g., Katz index, the Rooted Page-Rank, the Common Neighbors, and the Adamic-Adar) between the corresponding node pairs $(v_{i}, v_{j})$, W is the node representation matrix, and C is the embedding of the node as the context. To reconstruct the matrix S more simply and elegantly, HOPE proposes to obtain W and C directly based on the low-rank singular value decomposition (SVD).

Graph kernel The kernel method is an important algorithm in pattern recognition and machine learning. Its basic idea is to give the graph embedding $x \in X$ in the original low-dimensional space X, and maps the embeddings to a high-dimensional feature space H through a nonlinear function $f^{ker}$. Then the nonlinear problem in X can be solved by constructing a linear algorithm in H. There are two main types of kernel methods on graph data. The first type uses the embedding method to convert the graph data into vectorial representation, and then directly implements the application based on the kernel function. However, due to the loss of mass graph structure information when transforming graphs into vectorial representation, such methods do not perform well in real scenarios. The second type of method introduces the graph kernel function to solve this problem. Based on retaining the advantages of the original kernel function, it directly represents the structural information of the graph data in the high-dimensional Hilbert space. The definition of the traditional method of graph kernel comes from R-convolution. According to the difference between the contrast substructure and the decomposition method of the graph structure, a large number of methods based on graph kernel have been proposed. For example, the work of [379, 380] proposed a random-walk kernel based on calculating the number of common synchronization between two graph structures, To reduce the computational complexity and optimize the random walk strategy, a graph kernel based on comparing the shortest path information between two graph structures is proposed. To capture more complex topological information, the Weisfeiler-Lehman subtree graph kernel is proposed, which is based on a one-dimensional Weisfeiler-Lehman isomorphism test algorithm to find isomorphic subtree structures in a bunch of graph structures [381].

C PFMs theory

Since pretraining has received great attention from the research community, the investigation in the theory-backed explanation is similarly eye-catching. During the unsupervised pretraining era before SSL, Erhan et al. [382, 383] shed some light on the theoretical explanation for the confirmation and clarity of learning difficulties. [382] researches the influence of pretraining with respect to architecture depth, model capacity, and the number of training samples, and demonstrates the robustness of pretraining from the perspective of both the optimization and the regularization. [383] further prove the regularizer role of the unsupervised pretraining in the downstream supervised tasks.

1.1 C.1 Different perspectives

Pretext tasks. [384] posits a mechanism based on approximate conditional independence (CI) to connect pretext and downstream task data distributions, which suggests that pretext tasks can self-supervisedly learn the representations from unlabelled data that reduce the sample complexity of downstream supervised tasks. The experiments both on CV and NLP tasks support this theory. Representation Learning via Invariant Causal Mechanisms (ReLIC) [181] also provides a theoretical understanding from the perspective that the explicit invariance constraints across augmentations can yield improved generalization guarantees.

Multi-view redundancy. From the perspective of a multi-view setting, [385] understands contrastive learning as exploiting multiple views of data for representation learning. This theory provides a theoretical analysis that the linear functions of these representations from pretraining are still competitive compared with the non-linear optimal predictor of the label. In other words, the linear functions of the learned representations are nearly optimal on downstream prediction tasks whenever the different views provide redundant information about the label.

1.2 C.2 Different categories

Contrastive learning. Although experimental results show us that previous designs such as contrastive loss or momentum updating can produce impressive performance in SSL. However, one of the most important questions that remain in SSL is why these methods can maintain representation consistency during the pretraining process. A naive view is the minimization between positive pairs can boost invariance learning, while the maximization between negative pairs contributes to avoiding representational collapse. [386] shows that contrastive learning can achieve competitive bound via intra-class concentration, thus leading to the reduction of sample complexity on downstream tasks from the benefit of transferred representations. This research also provides a framework that can be utilized both on the guarantees of the quality of learning representations during the pretraining phase and the future assumptions added to the framework that allow tighter guarantees.

Non-contrastive learning. While contrastive learning shows an effect by capturing the similarity and dissimilarity among the unlabelled examples, and further converging to an average local optimum which represents the general representations, recent non-contrastive SSL methods such as BYOL and SimSiam also shows the SOTA performance without the design of comparison between negative pairs. Based on the analysis of the eigenspaces, Tian et al. [182] study the behavior of non-contrastive SSL training and prove that the effects are from both the predictor and stop-gradient signal. Based on this theory, a novel and simple DirectPred method is proposed as a by-product of this theoretical exploration.

D Pretext task taxonomy on CV

Pretext tasks are always designed to use pseudo labels generated from the data itself to pretrain the proxy model. There are five categories of pretext tasks for self-supervised: (1) generation-based methods; (2) transformation-based methods; (3) context-based methods; (4) semantic-based methods; (5) view-based methods.

Generation-based methods This type of method is GAN-based in the deep learning era. For image generation, there are several applications including image colorization [138, 387], image super-resolution [388], image editing [389], context encoders [137], image-to-image translation [352], etc. On the other hand, video generation tasks contains future prediction [145], video action recoginition [241], video generation [390, 391], and video representaion [392].

Transformation-based methods Transformation is a typical technology that serves as a data augmentation method to enlarge the training dataset in traditional DL. However, if transformations of the same image are labeled as positive samples and others as negative samples, this pretext task can be used for self-supervised pretraining [166]. Popular transformation in self-supervised learning (SSL) contains color transformation (such as Jitter, Gaussian blur, and adjusting brightness) and geometric transformation (such as flipping, cropping, scaling, and rotation).

Context-based methods Basically, the design and construction of many artificial tasks, such as solving Jigsaw puzzles [140], comparing context similarity, and discriminating sequence order. Solving Jigsaw puzzles is defined as identifying the correct position of patches from an image. This task can help the model to learn an encoder for transfer learning [141, 393], and the feature representations are effective after the pretrained dataset is big enough. In addition, the design of video Jigsaw is also proposed for unsupervised learning [394]. Differently, context similarity tries to label the patches from the same images as positive samples and others as negative samples, then use a predefined similarity function to scale the distance between different pairs [49].

Semantic-based methods Semantic-based methods contain object detection, semantic segmentation, and depth prediction. These tasks also involve pretext tasks because their pixel-based labels can learn a more robust feature representation than simpler tasks. These pre-text tasks always establish on video dataset [395, 396].

View-based methods This type of method contains both single-modal data and multi-modal data. For the single-modal data, the original data is treated as the anchor and different viewpoints generate its positive pair samples. Sometimes the time slices in sequence-based data are treated as negative pairs because the scene is changed as time goes [397]. In addition, multi-modal data is usual in view-based methods, which are also called cross-modal-based methods here. Such as audio-video cooperative learning [398], RGB and optical flow cross-modal distance training [250].

E Evaluation metrics

Classification task The classification task, according to a labeled training document, determines the relationship between document features and document categories. The learned relationship model is then used to determine the category of new documents.

Accuracy and error rate The key metrics for a text classification model are Accuracy and Error Rate. The terms Accuracy and Error Rate are defined as follows:

$$\begin{aligned} Accuracy =\frac{(\textrm{TP}+\textrm{TN})}{N}, \end{aligned}$$

(20)

$$\begin{aligned} Error Rate = 1 - Accuracy =\frac{(\textrm{FP}+\textrm{FN})}{N}, \end{aligned}$$

(21)

where $\textrm{TP}$ and $\textrm{FP}$ denote true positive and false positive, $\textrm{TN}$ and $\textrm{FN}$ stand for true negative and false negative.

Precision, Recall and F1 Regardless of the standard type and error rate, there are very important metrics used for unbalanced testing sets. These metrics are similar to the concept of the class label in the testing samples. F1 is defined as the harmonic average of Precision and Recall. Thus, Accuracy, Recall, and F1 can be represented as:

$$\begin{aligned} Precision =\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}, \quad Recall =\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \end{aligned}$$

(22)

$$\begin{aligned} F1 =\frac{2 \mathrm { Precision \times Recall }}{\textrm{ Precision }+\textrm{Recall}}. \end{aligned}$$

(23)

When the accuracy, F1, and recall values hit 1, the desired results are obtained. On the other hand, when the values turn 0, we get the worst consequence. For the multi-class classification task, the precision and recall values of each class can be determined independently, and then the individual and overall performance can be analyzed.

Micro-F1 The Micro-F1 [399] is a metric that measures all labels’ overall accuracy and recall. We denote $Micro\text {-}F1$ as:

$$\begin{aligned} Micro\text {-}F1=\frac{2 \textrm{P}_{t} \times R_{t}}{\textrm{P}+\textrm{R}}, \end{aligned}$$

(24)

$$\begin{aligned} P=\frac{\sum _{t \in \mathcal {S}} T P_{t}}{\sum _{t \in S} T P_{t}+F P_{t}},\quad R=\frac{\sum _{t \in S} T P_{t}}{\sum _{t \in \mathcal {S}} T P_{t}+F N_{t}}. \end{aligned}$$

(25)

where $T P_{t}$ and $F P_{t}$ mean true and false positive of the t th label on a text.

$Macro\text {-}F1$ The $Macro\text {-}F1$ calculates the average F1 of all labels by giving equal weight to them. $Macro\text {-}F1$ is denoted as:

$$\begin{aligned} {Macro}\text {-}F1=\frac{1}{\mathcal {S}} \sum _{t \in \mathcal {S}} \frac{2 \textrm{P}_{t} \times R_{t}}{\mathrm {P_{t}}+\mathrm {R_{t}}}, \end{aligned}$$

(26)

$$\begin{aligned} P_{t}=\frac{T P_{t}}{T P_{t}+F P_{t}},\quad R_{t}=\frac{T P_{t}}{T P_{t}+F N_{t}}. \end{aligned}$$

(27)

where $T N_{t}$ and $F N_{t}$ represent true and false negative of the t th label. $\mathcal {S}$ stands for the label set of all samples.

Mean reciprocal rank (MRR) The MRR is commonly used to evaluate the performance of ranking algorithms on Question Answering (QA) and Information Retrieval (IR) tasks. MRR is represented as

$$\begin{aligned} \textrm{MRR}=\frac{1}{Q} \sum _{i=1}^{Q} \frac{1}{{rank}_{i}}, \end{aligned}$$

(28)

where ${rank}_{i}$ is the ranking of the i th ground-truth answer. The number of predicted labels on each text is denoted by Q. Moreover, there are some metrics, such as EM, Hamming-loss [400], P@K and NDCG@K.

Generation task Generation task uses LMs to predict the next most likely word or sentence based on input data.

Bilingual evaluation understudy (BELU) BLEU compares the generated sentences to the reference sentence and makes predictions using automatic machine translation algorithms. The language creation problem is also supported by deep learning technologies such as speech recognition, image caption generation, and text summarization. They can’t discover anything better, but it has a few advantages: it’s simple to comprehend, correlates well with human judgment, and is language-independent. As a bilingual evaluation aid, BLEU is mainly used to evaluate the quality of machine translation [401]. BLEU compares the degree of overlap between the N-gram in the candidate text and the N-gram in the reference text. The higher overlap indicates better translation quality. The formula for the computation is:

$$\begin{aligned} B L E U=BP \times {\text {exp}}\left( \sum _{n=1}^{N} W_{n} \log P_{n}\right) , \end{aligned}$$

(29)

where N represents N-gram, BP is penalty factor, $P_{N}$ is multivariate precision, and $W_{N}=1/N$ is the corresponding weight of multivariate precision. r represents the length of the shortest reference translation, and c represents the length of the candidate translation, then the specific calculation method of penalty factor BP is as follows:

$$\begin{aligned} BP= {\left\{ \begin{array}{ll}1, & l_{t}>l_{a} \\ e^{1-l_{a} / l_{t}}, & l_{t} \le l_{a}\end{array}\right. }, \end{aligned}$$

(30)

where $l_{t}$ is the number of words in machine translation and $l_{a}$ is the number of words in reference answer. The penalty factor is mostly used to penalize large gaps between machine and reference translations.

ROUGE (recall-oriented understudy for gisting evaluation) ROUGE stands for N-gram co-occurrence statistics, which are used in automatic evaluation methods. It is expanded on the similarity of N-grams, which means that an N-gram is a subsequence of the main document text in terms of N words. There are four types of ROUGE, including ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. The first two are commonly used, and the N in rouge-N refers to N-gram, which is calculated similarly to BLEU, except BLEU is based on accuracy, while ROUGE is based on recall. L in ROUGE-L refers to the Longest Common Subsequence, which is calculated as the Longest Common Subsequence between the candidate abstract and the reference abstract. Thus, the longer the length, the higher the score, based on the F value. The calculation formula of ROUGE-N and ROUGE-L is mainly introduced. The calculation formula of ROUGE-N is as follows:

$$\begin{aligned} ROUGE-N=\frac{\sum _{S \in \{\text{ ReferenceSummaries }\}} \sum _{\text{ gram}_{n} \in S} \text{ Count}_{\text{ match }}\left( {\text {gram}}_{n}\right) }{\sum _{S \in \{\text{ ReferenceSummaries }\}} \sum _{\text{ gram}_{n} \in S} \text{ Count }\left( \text{ gram}_{n}\right) }, \end{aligned}$$

(31)

where N stands for N-gram, $Count(gram_{n})$ represents the frequency of occurrence of an N-gram, and $Count_{match}(gram_{n})$ represents the frequency of co-occurrence of an N-gram. The calculation formula of ROUGE-L is as follows:

$$\begin{aligned} ROUGE-L=F_{lcs}=\frac{\left( 1+\beta ^{2}\right) R_{\textrm{lcs}} P_{\textrm{lcs}}}{R_{1 \textrm{cs}}+\beta ^{2} P_{\textrm{lcs}}}, \end{aligned}$$

(32)

$$\begin{aligned} R_{\textrm{lcs}}=\frac{L C S(X, Y)}{M}, \end{aligned}$$

(33)

$$\begin{aligned} P_{\text{ lcs } }=\frac{L C S(X, Y)}{N}, \end{aligned}$$

(34)

where X is the candidate abstract, Y represents the reference abstract, LCS(X, Y) table indicates the length of the Longest Common Subsequence (LCS) of the candidate abstract and references abstract, M stands for the length of reference abstract, and N denotes the length of the candidate abstract. The ROUGE method is characterized by N-gram co-occurrence statistics, based on recall rate (ROUGE-N) and F-value (ROUGE-L). They are often used in text summaries. It is worth noting that ROUGE is word-based correspondence rather than semantic-based correspondence, but this can be mitigated by increasing the number of reference summaries.

METEOR METEOR, also known as an explicitly sorted translation evaluation metric [402], is an improved version of the BLEU standard that aims to address some flaws in the BLEU standard. Using WordNet to calculate matching relationships between specific sequences, synonyms, roots, affixes, and definitions improves BLEU performance and makes it more relevant to manual discrimination. The calculation formula is as follows:

$$\begin{aligned} METEOR=(1-Pen) \times F_{\text{ m }}, \end{aligned}$$

(35)

$$\begin{aligned} F_{\text{ m }}=\frac{P R}{\alpha P+(1-\alpha ) R}, \end{aligned}$$

(36)

$$\begin{aligned} P=\frac{m}{\sum _{k} \text{ h}_{k}(c_{i})}, \end{aligned}$$

(37)

$$\begin{aligned} R=\frac{m}{\sum _{k} \text{ h}_{k}(s_{ij})}, \end{aligned}$$

(38)

where $Pen=\gamma (\frac{ch}{m})^{\theta }$ is a penalty factor, which punishes the word order in candidate translation that is different from that in reference translation. ch refers to the number of chunks, which are clustered units of matched units adjacent to each other in both the candidate translation and the candidate reference translation. $\alpha , \beta , \theta$ is the adjustable parameter, m is the number of unary groups that can be matched in the candidate translation, c is the length of the candidate translation, $h_{k}(c_{i})$ is the number of occurrences in candidate translations $c_{i}$, and $h_{k}(s_{ij})$ is the number of occurrences in reference translations $s_{ij}$.

Perplexity Perplexity is also called the degree of confusion [403]. Its core idea is: first, according to the testing sentence, learn an LM P. Then, according to the LM P, the score of the optional sentence is calculated. Finally, the above scores are standardized according to sentence length. The calculation formula is as follows:

$$\begin{aligned} P P L(W)=P\left( w_{1}, w_{2}, \ldots , w_{M}\right) ^{-\frac{1}{M}}, \end{aligned}$$

(39)

where W is the candidate translation, M is the length of the candidate translation, P is the LM obtained according to the reference translation, and $P(w_{1}, w_{2}, \ldots , w_{M})$ is the score calculated by the LM for the candidate translation. The Perplexity assessment indicator is based on a LM. The lower the degree of confusion, the better the translation quality, which is often used in machine translation and LMs. Its disadvantages are as follows: the larger the data set is, the faster the degree of confusion decreases; the punctuation in the data will impact the PPL of the model; and the interference of common words.

F Datasets

1.1 F.1 Downstream tasks and datasets on NLP

There are many available data sets in the NLP domain, divided according to different tasks. It mainly comprises two categories: the task of classification of texts and the task of generating texts. The text classification tasks mainly include Sentiment Analysis (SA), News Classification (NC), Topic Labelling (TL), Natural Language Inference (NLI), Named Entity Recognition (NER), Question Answering (QA), Dialogue Act Classification (DAC), etc. The generation tasks mainly include text summaries and machine translation (Table 5).

Table 5 The statistics of the datasets on NLP. For the QA task, the class represents the sum number of candidate answers and the correct answer. For dialogue, class is the number of slots. Length means the average tokens in turn

Full size table

Sentiment analysis (SA)

It consists of judging the emotional polarity and dividing it into several classes. Depending on the granularity of sentiments, the SA is divided into three categories: dichotomy (positive and negative), trichotomy (positive, negative, and neutral), and multiple categories. Here we introduce several datasets in detail.

Stanford sentiment treebank (SST) [474] The dataset is an extension of MR [475]. SST-1 is a version of SST. It is divided into five categories and the number of training texts and testing texts is 8544 and 2210, respectively. It also consists of 20 average tokens. The SST-2 [409] contains 9613 movie reviews including 6920 training texts, 872 development texts, and 1821 testing texts.

Semantic textual similarity benchmark (STS-B) [476] It is used in semantic textual similarity tasks organized in the SemEval context between 2012 and 2017 [477]. It consists of text from image titles, news titles and forums. On a scale of 1 to 5, STS-B displays the semantic similarity of two sentences. It includes 5749 training sets, 1379 development sets, and 1377 testing sets.

Multi-perspective question answering (MPQA) [478, 479] This is an opinion dataset which has two categories. It contains 10,606 sentences from various news sources that have been manually annotated for opinions and other private states. It is worth noting that there are 3311 positive articles and 7293 negative articles, having no labels for each article.

IMDB reviews [480] The dataset is the world’s most authoritative source for binary sentiment classification of film reviews. The number of content in each class is the same and it can be divided into training and testing sets whose number of comments is 25,000 on average.

News classification (NC)

As one of the most vital information sources, news content exerts a critical effect on people. The NC facilitates users to acquire essential knowledge in real time. Its applications mainly include news topic identification and recommendation of relevant news based on user interests. Here we introduce several datasets in detail.

20 Newsgroups (20NG) [481] 20NG is a text dataset derived from newsgroups. There are 20 classes with the same number of articles per class, including 18,846 articles in total. The average number of tokens is 221.

AG news [424, 482] This is an academic news search engine, which is divided into four categories. It contains news headlines and introductions. It includes 120,000 training texts and 7600 testing texts. The number of average tokens is 45/7.

R8 and R52 [483] They come from Reuters [484]. R8 contains 8 classes consisting of 66 average tokens and includes 2189 and 5485 testing and training courses. There are 52 classes in R52, which consists of 70 average tokens. It is divided into 6532 and 2568 training and testing texts.

Topic labeling (TL)

The task mainly obtains the meaning of the file by defining complex file themes. It is a critical component of topic analysis technology, which aims at simplifying topic analysis by assigning each article to one or more topics. Here, we introduce a few in detail.

DBpedia [485] It is a large-scale multilingual knowledge base generated by Wikipedia’s most commonly used information boxes. It releases DBpedia every month, adding or removing classes and attributes in each version. The most popular version of DBpedia has 14 categories, separated into 560,000 training data and 70,000 testing data. The number of average tokens is 55.

Ohsumed [486] This is a biomedical literature database. The number of texts is 7400. It has 23 cardiovascular disease categories and consists of 136 average tokens. All texts are medical abstracts that are categorized into one or more classes.

Yahoo answers (YahooA) [424] The dataset is a topic labeling task having 10 categories. The number of average tokens is 136. There are 140,000 training data and 5000 testing data. Each text in YahooA has question titles, question contexts, and best answers.

Natural language inference (NLI)

This task is used to forecast whether the meaning of a text can be inferred from another. Interpretation is a broad form of NLI. By comparing the semantic similarity of sentence pairings, it determines whether a sentence is the interpretation of another one. Here we introduce several primary datasets in detail.

The Stanford natural language inference (SNLI) [430] It is commonly used in NLI takes. It contains 570,152 human-annotated sentence pairs, which are annotated with three sorts of relationships: neutral, derived, and conflicting. Multi-genre Natural Language Inference (MNLI) [487] has 3 categories and consists of 430,000 sentence pairs annotated with textual information, which is usually used in textual inference tasks. Question Natural Language Inference (QNLI) [488], whose task with 2 classes is to determine whether a given text pair is a question-answer. Winograd Natural Language Inference (WNLI) [489] which consists of 2 categories is a data set that captures the standard reference information between two paragraphs.

Microsoft research paraphrase (MSRP) [435] The dataset contains sentence pairs for the text-similarity task, including 1725 training and 4076 testing sets. A binary label annotates each pair, discriminating whether they are paraphrases.

Sentences involving compositional knowledge (SICK) [434] It includes nearly 10,000 English sentence pairs, marked with similarity, and the scale range is 1–5. It has neutral, entailment, and contradictory three categories.

Named entity recognition (NER)

This is a fundamental task of NLP to identify people, places, organizations, and other entities in text. It is a crucial primary tool for many NLP tasks, including information extraction, question answering, semantic parsing, machine translation, etc.

CoNLL 2003 [303] It consists of newswire text from the Reuters RCV1 corpus. It contains four different entity types (Location, Organization, Person, and Miscellaneous) and includes 1393 English news articles, and 909 German news articles.

OntoNotes 5.0 [13] The dataset consists of 174,5K English, 900K Chinese, and 300K Arabic text data. It comes from telephone conversations, news agencies, radio news, radio conversations, and blogs. It has 18 entity classes containing 11 types, seven values, and 2,945,000 text data.

MSRA [440] This is a Chinese dataset that is obtained from the news domain. It has three types of entities and is used as a shared task on SIGNAN back in 2006.

Question answering (QA)

There are two types of QA systems: the extraction guidance system and the generation guidance system. The extractive QA can be regarded as a particular case of text classification. Here we detail several datasets.

Microsoft research paraphrase corpus (MRPC) [490] It contains 5800 sentence pairs extracted from Internet news, and the task type is similar to the QQP dataset. Sentence pairs are derived from comments on the same news item and determine whether the two sentences are semantically the same. The assessment criteria were classification accuracy and F1 score.

Stanford question answering dataset (SQuAD) [303] This is a large-scale machine-reading comprehension dataset that contains two tasks. SQuAD 1.1 [488] provides questions and corresponding answers, and the data set contains 100,000 samples in total, while SQuAD 2.0 [491] adds unanswered questions and expands the scale to 150,000.

RACE [492] The dataset has 5 categories, containing nearly 100,000 questions extracted from middle and high school English tests, with corresponding answers given by experts. The average length of RACE text is more significant than 300, which is longer than other reading comprehension data sets (such as SQuAD) sequences.

Dialog act classification (DAC)

The dialogue act is a specific verbal component, which marks the dialogue according to the meaning category of the dialogue. DAC categorizes tags according to the meaning of the dialogue to help understand the speaker’s intentions.

Dialog state tracking challenge 4 (DSTC 4) [451] It belongs to the dialog act classification task and mainly focuses on dialog state tracking on human-human dialogs. It is divided into 89 training classes and contains 24,000 training texts and 6000 test texts.

ICSI meeting recorder dialog act (MRDA) [452] It includes about 75 h of speech from 75 naturally occurring meetings among 53 speakers. The number of categories is 5, and it contains 51,000 training texts, 11,000 test texts, and 11,000 validation texts.

Switchboard dialog act (SwDA) [493] The dataset extends the dialogue behavior label with rounds/discourses. The label summarizes the sentence structure, and relevant and pragmatic information of the relevant turn. The SwDA is split into 43 training classes and includes 1,003,000 training texts, 19,000 test texts, and 112,000 validation texts.

Text summarization

Text summarization is a summary of a given single or multiple documents. It is kept as concise as possible while ensuring that it reflects the critical content of the original document. It can be divided into extractive summarization and generative summarization. Extractive summarization is generated by extracting and splicing the critical sentences in documents. Generative summarization is generated by a model, which summarizes documents according to the required content expressed in documents.

NYT [455] The dataset comes from the corpus annotated by the New York Time. The named entities are annotated using the Stanford NER tool in conjunction with the Freebase knowledge base. It contains 9076 articles, with the remaining 100,834 divided into a training set (96,834 examples) and a validation set (4000 samples).

CNN/daily mail [457] It is used for the passage-based question-answering task, and it is popular in assessing ATS systems. The dataset consists of CNN/Daily Mail news stories paired with multi-sentence human-generated summaries. There are 287,226 training instances, 13,368 validation instances, and 11,490 testing instances in total.

Gigaword [464] This is a data set of English news chapters consisting of nearly 950 pieces. Headlines – stories from multiple sources, including the New York Times – include some articles with a one-sentence, short news feed.

Machine translation (MT)

It refers to the task of translation from one language to another with its semantic equivalence by a computer. There are three categories, rule-based machine translation, statistics-based machine translation, and neural network-based machine translation.

WMT14 [465] It is a grouping of datasets used in the Ninth Workshop on Statistical Machine Translation shared tasks, including a news translation task, a quality estimation task, a metrics task, and a medical text translation task.

WMT16 [466] This dataset is a grouping of datasets used in the First Conference on Machine Translation shared tasks. It has ten shared tasks, including a news translation task, an IT domain translation task, a biomedical translation task, an automatic post-editing task, a metrics task, a quality estimation task, a tuning task, a pronoun translation task, a bilingual document alignment task, and a multimodal translation task.

WMT17 [465] The dataset includes three MT tasks (news, biomedical, and multimodal), an automatic post-editing task, a quality estimation task, a task dedicated to the training of neural MT systems, a task on bandit learning for MT, an automatic post-editing task, and a metrics task.

WMT18 [468] It mainly features six shared tasks: a news translation task, a biomedical translation task, an automatic post-editing task, a metrics task, a quality estimation task, and a multimodal translation task. Participants must evaluate their approaches to the machine translation topic using the standard data sets created for the shared tasks.

Dialogue

As an essential way of man–machine interaction, the dialogue system offers a wide range of applications. The existing dialogue systems can be grouped into task-oriented dialogue systems and non-task-oriented dialogue systems from application scenarios. Among them, the non-task type of conversation system can also be called a chatbot.

DSTC2 [471] This is a multi-round dialogue data set of restaurant reservation fields, including 1612 training data, 506 verification data, and 1117 test data. It allows the user’s goals to change compared to DSTC1. DSTC2 is also richer in terms of the conversation state representation, including the slot value pairs of the user’s targets and the ways to find them.

MWOZ [471] It contains 8420/1000/1000 conversations for training, validation, and test sets, respectively. It contains 30 pairs in seven domains being a multi-domain fully-labeled corpus. Every sample includes a goal, multiple user and agent utterances, and annotations regarding slot values.

Out-of-scope (OOS) [471] The dataset includes 15,100 training, 3100 validation, and 5500 test sets, respectively. It contains 151 intent classes, containing 150 in-scope and one out-of-scope intent. The out-of-scope intent indicates that a user utterance failed to classify to given predefined objectives.

1.2 Downstream tasks and datasets on CV

Table 6 The statistics of the datasets used on downstream tasks

Full size table

The datasets in CV mainly contain three types from the perspective of tasks: classification, detection, and segmentation. The popular datasets are concluded in Table 6, and some infrequently mentioned datasets in long tails are discussed in the text.

Classification

In this part, we first cover the popular large-scale datasets used frequently in both the pretext and downstream tasks. Then the domain datasets only used for the downstream tasks are unfolded.

MNIST [497] It’s a collection of handwritten digits that includes 60, 000 samples in training and 10, 000 in testing. The images are fixed-size with $28\times 28$ pixels. The pixel values are from 0 to 255.0 in which pixel values smaller than 255.0 can be understood as background (white) and 255 means foreground (black). The labels are from 0 to 9 and only one of these digits exists in an image. Both traditional and deep learning methods are based on this most popular dataset despite advanced methods showing perfect results. Thus, Geoffrey Hinton has described it as “the drosophila of machine learning”.

Street view house numbers (SVHN) [498] In the domain of digit numbers, it collects real-world digit numbers from house numbers in Google Street View images. It includes 73, 257 digits for training, 26, 032 digits for testing, and 531, 131 additional. All of them are $32\times 32$ color images with both class labels and character-level bounding boxes.

CIFAR [499] As more advanced methods show perfect results on the simple datasets, more sophisticated datasets such as CIFAR-10 and CIFAR-100 are conducted. These two datasets are closer to the real-world object. The CIFAR-10 contains 50,000 training images and 10,000 testing images, with 6000 images per class and $32\times 32$ pixels in each RGB color image. The CIFAR-100 is similar to the CIFAR-10 but with more detailed label information. There are 100 classes containing 500 training images and 100 testing images in each class. In addition, these 100 “fine” classes are grouped equally into 20 “coarse” classes. Researchers can adapt it to suitable learning methods.

STL-10 [500] Inspired by the CIFAR-10 dataset, STL-10 is another $96\times 96$ color image dataset containing similar 10 real-world classes. Each class has 500 training images and 800 testing images. The biggest difference is that STL-10 has 100, 000 unlabeled images for unsupervised learning. More construction information can be seen in [501].

Caltech-101 [502] It collects roughly $300\times 200$ color images of objects belonging to 101 categories, with 40–800 images per category and 50 on average. The outlines of the objects in the pictures are annotated for the convenience of different learning methods.

ImageNet [312] This is one of the most popular and large-scale datasets on computer vision. It is built according to the hierarchical structure of WordNet [503]. The full ImageNet dataset contains 14, 197, 122 images and 21, 841 synsets indexed, attaching on average 1000 images to demonstrate each synset. The most frequently-used subset of ImageNet is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset from 2010 to 2017, containing tasks of classification, localization, and detection. The number of samples in training and testing datasets and the labels of images are determined by the specific task, more details are seen in [504].

HMDB51 [505, 506] In addition to the popular MNIST, there still exist many domain datasets used for the downstream tasks in the classification problem. HMDB51 is an action video database for a total of 7000 clips in 51 action classes. It contains five types of facial actions and body movements.

UCF101 [507] It is another action video data set designed for more realistic action recognition. It is an extension of the UCF50 [508] data set containing only 50 action categories with 101 action categories, collected from YouTube. What makes it a famous recognition dataset is the workshop in ICCV13 with UCF101 as its main competition benchmark.

Food-101 [509] This is a real-world food dataset of 101 food categories, with 750 and 250 images per class in training and testing dataset respectively.

Birdsnap [510] It is a fine-grained visual categorization of birds on a broad scale, with bounding boxes and the locations/annotations of 17 parts in the object. It contains 49,829 images of the 500 most common species in North America, with each species containing 69–100 images and most species having 100. In addition, some images are also labeled as male or female, immature or adult, and breeding or non-breeding plumage.

SUN397 To target the scene categorization, the extensive Scene UNderstanding (SUN) database [511, 512] fills the gap of the existing dataset with the limited scope of categories. This database has 899 categories and 130,519 images, and only images with more than $200\times 200$ pixels were kept. SUN397 is a more well-sampled subset that maintains 397 categories with at least 100 images per category, in which other categories containing relatively few unique photographs are discarded.

Places205 Places205 [513] dataset is another large scale scene dataset consists of 2,448,873 images from 205 scene categories.

Cars [514] The dataset in the domain of cars contains 16,185 color images of 196 classes (at the level of Make, Model, Year) of cars. For convenience, this dataset is split into training and testing sets in roughly equal quantities.

Aircraft [515] It is another fine-grained visual classification designed for aircraft (also known as FGVC-Aircraft). A popular form of this dataset is the fine-grained recognition challenge 2013 (FGComp2013) [516] ran in parallel with the ILSVRC2013. There exist four-level hierarchies: Model, Variant, Family, Manufacturer, from finer to coarser to organize this database. The more detailed information is shown in [517].

Pets [518] It represents The Oxford-IIIT Pet Dataset that collects 37 pet categories with roughly 200 images per category. All images have an associated ground truth annotation of breed for classification, head ROI for detection, and pixel-level trimap for segmentation.

Flowers [519] Similarly, Flowers is another domain dataset in flowers also collected by Oxford; it contains Oxford-17 Flowers of 17 categories and Oxford-102 Flowers of 102 categories.

Describable textures dataset (DTD) [520] This is an evolving collection of textural images in the wild, which consists of 5640 images of 47 categories, with 120 images per category.

iNaturalist2018 [521] It is a large-scale species classification competition conducted on the FGVC5 workshop at CVPR2018. This dataset contains over 8000 species categories, with more than 450, 000 images in the training and validation dataset collected from iNaturalist [522].

JFT-300 M [523] JFT-300 M is an internal Google dataset introduced by Sun et al [523] and well-known from ViT Model [40]. It is labeled by algorithms that utilize human-computer communications and target classification tasks. This dataset finally contains 300 M images with over 1000 M labels, thus leading to the multiple labels attached to this large-scale dataset.

Detection

The detection is a popular task in the CV, and almost all the research is conducted on COCO and PASCAL VOC datasets.

COCO [339] This is a large-scale data set for object detection, segmentation, and caption; it contains 330, 000 RGB images, with more than 200, 000 labeled. There are 1.5 million object instances of 80 object categories involved. Thus, it is one of the most popular benchmark data set in detection and segmentation in parallel with the following PASCAL VOC.

PASCAL VOC [524] From 2005 through 2012, the dataset has run challenges assessing performance on object class recognition and has provided standardized image datasets for object class recognition. The main datasets used in self-supervised learning are VOC07, VOC11, and VOC12. Main competitions in VOC07 [525] contain classification and detection tasks; both of them consist of 20 objects and contain at least one object in each image. Thus, it is common to use VOC07 to serve as the downstream task for the detection.

Segmentation

The segmentation is a semantics-based pixel-level classification. These datasets are difficult to obtain and annotate, thus they are always used as a downstream task.

VOC11 [526] & VOC12 [527] Both VOC11 and VOC12 contains classification, detection, and segmentation tasks in the main competition, thus leading to the common use of downstream task for the segmentation.

ADE20K [528, 529] It collects 27, 574 images from both the SUN and Places205 databases, in which 25, 574 for training and 2000 for testing. All 707, 868 objects from 3688 categories existing in images are annotated. Especially, this dataset contains 193, 238 annotated object parts and parts of parts, and additional attributes, annotation time, depth ordering for the benefit of the research community.

NYU-Depth V2 [530] This is a dataset consisting of images and video sequences from 464 indoor scenes that are recorded by both the RGB and Depth cameras from 3 cities. It contains 1449 images with the ground truth of depth, and the original RGB values are also provided. In addition, there are 407, 024 new unlabeled frames and additional class labels for the objects in images.

Cityscapes [531, 532] It is a dataset of urban street scenes from 50 cities with the ground truth of semantic segmentation. The main instances are vehicles, people, and construction. The high-quality dense pixel annotations contain a volume of 5000 images. In addition to the fine annotations, coarser polygonal annotations are provided for a set of 20, 000 images. Moreover, the videos consist of not consistent images with high-quality annotations, and these annotated images with consistently changing views are provided for researchers.

LVIS [533] It is a dataset for large vocabulary instance segmentation. It features that (1) a category or word in one image is related to the only segmentation object; (2) more than 1200 categories are extracted from roughly 160,000 images; (3) long tails phenomenon exist in these categories; and (4) more than 2,000,000 high-quality instance segmentation masks.

Densely annotated video segmentation (DAVIS) [534] It is a video dataset designed for the in-depth analysis of the SOTA in video object segmentation, in which DAVIS 2017 [535] contains both semi-supervised (human-guided at the testing time) and unsupervised (human non-guided at test time) video sequences with multiple annotated instances.

Others

There are many datasets designed for special visual tasks such as inpainting. In addition, this part covers the data collection in the wild.

Paris StreetView [536] The dataset is designed for image inpainting task, which contains 14, 900 training images and 100 testing images collected from street views of Paris. This dataset is collected from Google Street View and mainly focuses on the buildings in the city.

Moving-MNIST [537] Based on MNIST, it is a video dataset designed for evaluating sequence prediction or reconstruction, which contains 10,000 sequences. Each video is long of 20 frames and consisted of two digits (possibly overlapped) moving inside a $64\times 64$ patch. The first benchmark is reported on [538] by the method of LSTMs.

Yahoo Flickr creative commons 100 million (YFCC100M) [539, 540] The dataset is the largest public multimedia collection that is allowed to search by users for their own targets; this dataset can browse both images and videos. It is free and for researchers to explore and investigate subsets of the YFCC100M in real time. Subsets of the complete dataset can be retrieved by any keyword search and reviewed directly. In addition, the text information attached to any image or video is abundant, such as containing location information and user tags. Briefly, it is more a multimedia library than a domain dataset.

Data in the wild More generalized dataset concept in the self-supervised learning era is composed of multimedia websites, APP, or search engines such as Instagram, Flickr, Google Images, etc. I think pictures in the wild will play a major role in the future study of CV because of the quantity of data, the computation source, and the learning power of PFM.

1.3 F.3 Downstream tasks and datasets on graph

The purpose of the pretraining graph model is to improve the performance of downstream tasks. According to the different analysis objects of the downstream tasks, they can be divided into nodes, edges, and graphs. Meanwhile, the PFMs of GL have been widely used in a mass of fields. In this section, we combine the downstream tasks to conduct statistics on the pretraining datasets and the downstream task datasets.

Node-level tasks

Nodes are the most basic element of the graph, so lots of downstream tasks mainly focus on the analysis of nodes.

Node classification Node ClassiFication (NCF) is one of the most prevalent graph-based tasks, which has important analytical value in most of the different types of graph data. Different from the pseudo-labels assigned to nodes in the graph in self-supervised methods, the labels in NCF often come from external information such as manual annotation. Based on Definition 7 and 8, NCF can be divided into two types: transductive and inductive according to the visibility during training, verification, and testing. In addition, the result of NCF can be single-label or multi-label according to the mutual exclusion of labels. The statistical results of common NFC datasets are shown in Table 7.

Table 7 The statistics of the datasets for node-level tasks. Homogeneous:Hom, Heterogeneous:Het

Full size table

Node clustering The goal of Node ClusterIng (NCI) is to divide a graph into different classes or clusters according to a certain standard so that the correlation of nodes in the same cluster is as large as possible, and the irrelevance of nodes that are not in the same cluster is also minimized. Although in the above-mentioned pretraining tasks, NCI is used as a pretext task has appeared, NCI can still test pretraining graph models based on other pretext tasks.

Top-K search The goal of task Top-K Search (TKS) is to search the K nodes with the highest predefined associations for a given node in the graph. Usually, TKS is used for search tasks such as recommendation and alignment. The detailed statistical results of the datasets are shown in Table 7.

Link-level tasks

The edge is also an important part of the graph structure, which associates independent nodes and is the key to distinguishing graph data from non-relational data. Especially in some specific fields (e.g., molecules, proteins), edges contain real information, so there are various tasks related to edges.

Link classification Similar to the NCF, the Link Classification (LC) also assigns one or more labels to a given edge. In fact, in LC, the nodes at both ends of the edge are still taken into consideration.

Link prediction Link Prediction (LP) is a common graph task (e.g., knowledge graph). The goal of LP is to predict edges that are removed or may exist in the graph. Similar to NCI, LP is also one of the pretext tasks in self-supervised learning, and its statistic results as shown in Table 8.

Table 8 The statistics of the datasets for LC. Homogeneous:Hom, Heterogeneous:Het

Full size table

Top-K recommendation Top-K Recommendation (TKR) is exactly the same as the definition of TKS, the difference lies in the sorting goal.

Graph-level tasks

The graph-level task generally focuses on the distribution of nodes, edges, and attributes in a given graph, in order to infer the possible properties of the entire graph.

Graph classification Graph Classification (GC) is commonly used in social, molecular, and protein graph data, which aims to predict the property of the given community, chemical compound, and protein. The statistic results as shown in Table 9.

Table 9 The statistics of the datasets for GC. Homogeneous:Hom, Heterogeneous:Het

Full size table

Data source

The PFMs of GL have been widely used in a mass of fields. We will descript the details of the pretraining datasets and the downstream task datasets.

Citation and co-author network A citation is a basic local representation, whose structure reflects the citation relationships of papers in a research direction or field. Specifically, a citation network is a kind of relational data composed of research papers as nodes and citation relations as edges. Among them, the citation network used in the GL model usually comes from local samples of common citation databases, e.g., Cora, Citeseer, and PubMed, and serves as downstream tasks. Similarly, the co-author network is a dataset of scientific collaboration that corresponds to a researcher’s ego network, in which the researcher and their collaborators are nodes and an edge indicates collaboration between two researchers. According to different requirements of downstream tasks, such co-author networks can be used for various tasks, e.g., node classification and graph classification.

Molecular and protein network A molecular network usually refers to a compound composed of atoms and atomic bonds, and predicting the properties of the compound is usually regarded as a graph classification task. For example, MUTAG is a collection of nitroaromatic compounds whose goal is to predict their mutagenicity to Salmonella typhimurium. PTC uses a graph to show the structure of multiple compounds and aims to predict the carcinogenicity of different compounds in rats. The protein network is a collection of proteins classified as either enzymes or non-enzymes. The amino acids are represented by nodes, and two nodes are connected by an edge if they are less than 6 Angstroms apart.

Social and movie network The social network is the social-relational data in the real network environment, which usually represents the relationship between users or posts. For instance, Reddit is a graph dataset comprised of Reddit posts made in September 2014. BlogCatalog is a graph dataset that represents a network of social relationships between bloggers who are listed on the BlogCatalog website. The movie network is usually composed of actors and their co-occurrence participation in the movie. For example, IMDB-B is a movie collaboration data set that contains a large number of self-networks of actors who play movie roles in IMDB. Nodes in each graph represent actors/actresses, and if they appear in the same film, an edge connects them. These graphs are based on action and romance genres. The difference between IMDB-M and IMDB-B is that a node in the graph represents one or more actors.

Others Some of the rarer graph data are used to test the universality of the PFM, such as word networks (Wikipedia), book networks (Book-crossing), and airline networks (US-Airport). In addition, there are also some special graph structures adapted to specific models, such as spatiotemporal graphs (METR-LA).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, C., Li, Q., Li, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02443-6

Download citation

Received: 06 October 2024
Accepted: 19 October 2024
Published: 24 November 2024
DOI: https://doi.org/10.1007/s13042-024-02443-6

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transformers in the Service of Description Logic-Based Contexts

Enhancing Unsupervised Pretraining with External Knowledge for Natural Language Inference

Large language models (LLMs): survey, technical frameworks, and future challenges

Explore related subjects

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Appendices

Appendix A Basic components

1.1 A.1 Basic components on NLP

1.1.1 A.1.1 Language model

1.2 A.2 Basic components on GL

1.2.1 A.2.1 Notations and definitions of graphs

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

1.2.2 A.2.2 Learning settings on graphs

Definition 7

Definition 8

B Traditional learning methods

1.1 B.1 Traditional text learning

1.2 B.2 Traditional image learning

1.2.1 B.2.1 Convolution-based networks

1.2.2 B.2.2 Recurrent neural networks

1.2.3 B.2.3 Generation-based networks

1.2.4 B.2.4 Attention-based networks

1.2.5 B.2.5 Transformer-based networks

1.3 B.3 Traditional graph learning

C PFMs theory

1.1 C.1 Different perspectives

1.2 C.2 Different categories

D Pretext task taxonomy on CV

E Evaluation metrics

F Datasets

1.1 F.1 Downstream tasks and datasets on NLP

1.2 Downstream tasks and datasets on CV

1.3 F.3 Downstream tasks and datasets on graph

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now