Survey of Neural Text Representation Models
<p>Word2Vec architecture. In addition to an input layer and an output layer, shallow architectures have a small number of hidden layers (one in this case).</p> "> Figure 2
<p>Recurrent architecture and the unfolding in time. Recurrent nodes have connections which lead to the next node, and connections which loop back to the same node. The unfolding in time shows the same node in three consecutive iterations.</p> "> Figure 3
<p>The LSTM (long short term memory) cell. Variables <math display="inline"><semantics> <msub> <mi>x</mi> <mi>t</mi> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>h</mi> <mi>t</mi> </msub> </semantics></math> are input and output, respectively, at time <span class="html-italic">t</span>. Squares with “<math display="inline"><semantics> <mi>σ</mi> </semantics></math>” or “tanh” represent layers, whereas ovals with “X”, “+”, or “tanh” represent pointwise operations.</p> "> Figure 4
<p>A parsing tree of a recursive neural network predicting word sentiment classes. The leaf nodes are input tokens, all the other nodes are representations of the combination of the child nodes. The root node is representation of the entire input text.</p> "> Figure 5
<p>Convolutional architecture. A convolution has multiple filters, and each filter has a kernel (a matrix of weights) that is being trained. The kernel slides over the values from the previous layer, producing values that are sent to the next layer. Each filter learns to recognize a different pattern.</p> "> Figure 6
<p>A visualization of a learned self-attention head on a sentence. The visualization shows learned relations between the words this self-attention head has learned. Each head learns a different kind of relations between the words.</p> "> Figure A1
<p>The average number of tasks models are evaluated on, by year.</p> "> Figure A2
<p>Number of models categorized by the main architecture and year. Values are normalized by rows. Higher values are darker.</p> ">
Abstract
:1. Introduction
2. Evaluation Datasets
- WordSim-353 [16] contains pairs of words with their similarity and relatedness scores.
- SICK (Sentences Involving Compositional Knowledge) [17] consists of English sentence pairs that were annotated for relatedness and entailment.
- SNLI (Stanford Natural Language Inference) [18] contains sentence pairs manually labeled for entailment classification.
- MultiNLI (Multi-Genre Natural Language Inference) [19] dataset consists of sentence pairs annotated with textual entailment information.
- MRPC (Microsoft Research Paraphrase Corpus) [20] has pairs of sentences along with human labels indicating if the pairs share paraphrase/semantic equivalence relationship.
- STS (Semantic Text Similarity) [21] consists of pairs of sentences with labels annotating their semantic similarity.
- SSWR (Semantic-Syntactic Word Relationship) [22] contains semantic and syntactic word relations (e.g., “Einstein: scientist”, “Mozart: violinist”, “think: thinks”, “say: says”).
- MSR (Measuring Semantic Relatedness) [23] is similar to the SSWR dataset.
- SST (Stanford Sentiment Treebank) [24] has parsing trees with a sentiment classification label for each node.
- IMDB (Internet Movie Database) [25] contains reviews of movies and their sentiment polarity.
- MR (Movie Reviews) [26] is similar to the IMDB dataset, hence contains opinions on movies.
- CR (Customer Reviews) [27] consists of customer reviews about products and their rating.
- SemEval’17 [28] is a dataset from the International Workshop on Semantic Evaluation. It consists of posts from Twitter with sentiment labels.
- TREC (Text REtrieval Conference) [29] is a dataset for question classification with questions divided into semantic categories.
- SQuAD (Stanford Question Answering Dataset) [30] consists of questions posed by humans on a set of Wikipedia articles, where the answer to every question is a segment of text.
- MPQA (Multi-Perspective Question Answering) [31] contains news articles annotated for opinions, beliefs, emotions, sentiments, speculations, and other private states.
- NQ (NaturalQuestions-Open) [32] contains Google queries and their answers.
- WQ (WebQuestions) [33] is a collection of questions found through Google Suggest API, which are manually answered.
- WSJ (Wall Street Journal) consists of Wall Street Journal stories from Penn Treebank [34].
- AG News [35] contains news categorized into four classes (business, sci/tech, world, and entertainment). This version of the dataset contains only titles and disregards the rest of the articles.
- GLUE (General Language Understanding Evaluation) [36] is a collection of datasets for natural language understanding systems.
- RACE (ReAding Comprehension dataset from Examinations) [37] is a dataset for the evaluation of methods in the reading comprehension task, consisting of English exams for Chinese students.
- SUBJ (SUBJectivity classification) [38] contains labeled subjective and objective phrases.
- WMT (Workshop on statistical Machine Translation) is a dataset used for machine translation tasks in WMT workshops.
3. Evaluation Tasks
4. Model Categorization
4.1. Representation Level
4.2. Input Level
4.3. Model Type
4.4. Model Architecture
4.5. Model Supervision
5. Shallow Models
Model | Input | Representation |
---|---|---|
Huang et al. [40] | word | word |
Word2Vec [22,47] | word | word |
Deps [60] | word | word |
GloVe [61] | word | word |
Doc2Vec [62] | word | sentence+ |
FastText [49] | n-grams | word |
6. Recurrent Models
7. Recursive Models
8. Convolutional Models
9. Attention Models
10. Discussion
10.1. Comparison
10.2. Computational Complexity
10.3. Evaluation
10.4. Challenges
10.5. Future Research
11. Conclusions
Author Contributions
Funding
Conflicts of Interest
Appendix A
Dataset | Models |
---|---|
WordSim-353 | Huang et al. [40], GloVe [61], FastText [49] |
SICK | Skip-Thoughts [64], FastSent [66], RL-SPINN [76], DiSAN [91] |
SNLI | CoVe [67], Kim et al. [89], Lin et al. [90], DiSAN [91], ReSAN [93], Star-Transformer [96] |
MultiNLI | ALBERT [100] |
MRPC | Skip-Thoughts [64], FastSent [66], ELECTRA [102] |
STS | FastSent [66], ELECTRA [102] |
SSWR | Word2Vec [22,47], GloVe [61], FastText [49] |
MSR | Word2Vec [22,47] |
SST | Doc2Vec [62], RCNN [65], CoVe [67], ELMo [41], Tree-LSTM [75], RL-SPINN [76], |
ST-Gumbel [77], DCNN [82], BERT [54], Bi-BloSAN [92], ELECTRA [102] | |
IMDB | Doc2Vec [62], CoVe [67], HAN [88] |
MR | FastSent [66], RAE [72], MV-RNN [73], AdaSent [74], CNN-LSTM [84], CDWE [43] |
CR | FastSent [66], CNN-LSTM [84], DiSAN [91], Bi-BloSAN [92] |
SemEval’17 | XLM [95] |
TREC | FastSent [66], CoVe [67], CNN-LSTM [84], CDWE [43], DiSAN [91], Transformer-XL [97] |
SQuAD | CoVe [67], ALBERT [100] |
MPQA | FastSent [66], RAE [72], AdaSent [74], CNN-LSTM [84], DiSAN [91], Bi-BloSAN [92] |
NQ | REALM [7] |
WQ | REALM [7] |
WSJ | DIORA [78] |
AG News | CDWE [43] |
GLUE | BERT [54], SpanBERT [101] |
RACE | ALBERT [100] |
SUBJ | FastSent [66], CNN-LSTM [84], DiSAN [91], Bi-BloSAN [92] |
WMT | Cho et al. [63], Seq2Seq [5], ByteNet [83], Transformer [51], MASS [98] |
Model | Year | Word Similarity | Word Analogy | Sentiment Classification | Sequence Labeling | Sentence+ Similarity | Paraphrase Detection | Machine Translation | Natural Language Inference | Question Classification | Subjectivity | Question Answering | Text Retrieval | Coreference Resolution | Reading Comprehension | Summarization | Winograd Schema Challenge | Long-Range Dependencies | Relation Extraction | Perplexity | BBC | Classification TASKS | Other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
N | 5 | 3 | 21 | 5 | 17 | 10 | 9 | 19 | 6 | 4 | 11 | 3 | 2 | 5 | 2 | 3 | 1 | 1 | 3 | 1 | 13 | 9 | ||
C&W [81] | 2008 | 2 | X | X | ||||||||||||||||||||
RAE [72] | 2011 | 1 | X | |||||||||||||||||||||
Huang et al. [40] | 2012 | 1 | X | |||||||||||||||||||||
MV-RNN [73] | 2012 | 1 | X | |||||||||||||||||||||
Word2Vec [22,47] | 2013 | 1 | X | |||||||||||||||||||||
Deps [60] | 2014 | 1 | X | |||||||||||||||||||||
GloVe [61] | 2014 | 3 | X | X | X | |||||||||||||||||||
DCNN [82] | 2014 | 2 | X | X | ||||||||||||||||||||
Doc2Vec [62] | 2014 | 1 | X | |||||||||||||||||||||
Cho et al. [63] | 2014 | 1 | X | |||||||||||||||||||||
Seq2Seq [5] | 2014 | 1 | X | |||||||||||||||||||||
Skip-Thoughts [64] | 2015 | 3 | X | X | X | |||||||||||||||||||
RCNN [65] | 2015 | 1 | X | |||||||||||||||||||||
AdaSent [74] | 2015 | 3 | X | X | X | |||||||||||||||||||
Tree-LSTM [75] | 2015 | 2 | X | X | ||||||||||||||||||||
CharCNN [48] | 2016 | 1 | X | |||||||||||||||||||||
HAN [88] | 2016 | 2 | X | X | ||||||||||||||||||||
FastSent [66] | 2016 | 5 | X | X | X | X | X | |||||||||||||||||
RL-SPINN [76] | 2016 | 3 | X | X | X | |||||||||||||||||||
CoVe [67] | 2017 | 4 | X | X | X | X | ||||||||||||||||||
Kim et al. [89] | 2017 | 4 | X | X | X | X | ||||||||||||||||||
Lin et al. [90] | 2017 | 3 | X | X | X | |||||||||||||||||||
ByteNet [83] | 2017 | 2 | X | X | ||||||||||||||||||||
Transformer [51] | 2017 | 1 | X | |||||||||||||||||||||
CNN-LSTM [84] | 2017 | 4 | X | X | X | X | ||||||||||||||||||
FastText [49] | 2017 | 2 | X | X | ||||||||||||||||||||
Akbik et al. [42] | 2018 | 1 | X | |||||||||||||||||||||
ST-Gumbel [77] | 2018 | 2 | X | X | ||||||||||||||||||||
ELMo [41] | 2018 | 6 | X | X | X | X | X | X | ||||||||||||||||
GPT [44] | 2018 | 5 | X | X | X | X | X | |||||||||||||||||
DiSAN [91] | 2018 | 5 | X | X | X | X | X | |||||||||||||||||
Subramanian et al. [68] | 2018 | 5 | X | X | X | X | X | |||||||||||||||||
Bi-BloSAN [92] | 2018 | 4 | X | X | X | X | ||||||||||||||||||
ReSAN [93] | 2018 | 2 | X | X | ||||||||||||||||||||
BERT [54] | 2018 | 6 | X | X | X | X | X | X | ||||||||||||||||
Liu et al. [94] | 2018 | 2 | X | X | ||||||||||||||||||||
GPT-2 [3] | 2019 | 6 | X | X | X | X | X | X | ||||||||||||||||
XLM [95] | 2019 | 3 | X | X | X | |||||||||||||||||||
Star-Transformer [96] | 2019 | 4 | X | X | X | X | ||||||||||||||||||
LASER [69] | 2019 | 3 | X | X | X | |||||||||||||||||||
DIORA [78] | 2019 | 2 | X | X | ||||||||||||||||||||
Transformer-XL [97] | 2019 | 2 | X | X | ||||||||||||||||||||
MASS [98] | 2019 | 2 | X | X | ||||||||||||||||||||
SBERT [99] | 2019 | 6 | X | X | X | X | X | X | ||||||||||||||||
CDWE [43] | 2020 | 4 | X | X | X | X | ||||||||||||||||||
XLNet [45] | 2020 | 8 | X | X | X | X | X | X | X | X | ||||||||||||||
ALBERT [100] | 2020 | 7 | X | X | X | X | X | X | X | |||||||||||||||
SpanBERT [101] | 2020 | 7 | X | X | X | X | X | X | X | |||||||||||||||
REALM [7] | 2020 | 1 | X | |||||||||||||||||||||
ELECTRA [102] | 2020 | 6 | X | X | X | X | X |
References
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- Goldberg, Y. Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 2017, 10, 1–309. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, UK, 2016. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. 2014. Available online: https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf (accessed on 29 October 2020).
- Jianqiang, Z.; Xiaolin, G.; Xuejun, Z. Deep Convolution Neural Networks for Twitter Sentiment Analysis. IEEE Access 2018, 6, 23253–23260. [Google Scholar] [CrossRef]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M.W. Realm: Retrieval-augmented language model pre-training. arXiv 2020, arXiv:2002.08909. [Google Scholar]
- Hailu, T.T.; Yu, J.; Fantaye, T.G. A Framework for Word Embedding Based Automatic Text Summarization and Evaluation. Information 2020, 11, 78. [Google Scholar] [CrossRef] [Green Version]
- Bodrunova, S.S.; Orekhov, A.V.; Blekanov, I.S.; Lyudkevich, N.S.; Tarasov, N.A. Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment. Future Internet 2020, 12, 144. [Google Scholar] [CrossRef]
- Martinčić-Ipšić, S.; Miličić, T.; Todorovski, L. The Influence of Feature Representation of Text on the Performance of Document Classification. Appl. Sci. 2019, 9, 743. [Google Scholar] [CrossRef] [Green Version]
- Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
- Jing, K.; Xu, J.; He, B. A Survey on Neural Network Language Models. arXiv 2019, arXiv:1906.03591. [Google Scholar]
- Camacho-Collados, J.; Pilehvar, M.T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 2018, 63, 743–788. [Google Scholar] [CrossRef] [Green Version]
- Ruder, S.; Vulić, I.; Søgaard, A. A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef] [Green Version]
- Aßenmacher, M.; Heumann, C. On the comparability of Pre-trained Language Models. arXiv 2020, arXiv:2001.00781. [Google Scholar]
- Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 406–414. [Google Scholar]
- Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; Zamparelli, R. A SICK Cure for the Evaluation of Compositional Distributional Semantic Models. In Proceedings of the 9th Language Resources and Evaluation Conference, Reykjavik, Iceland, 26–31 May 2014; pp. 216–223. Available online: http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf (accessed on 29 October 2020).
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
- Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1 (Long Papers)), New Orleans, LA, USA, 1–6 June 2018; pp. 1112–1122. [Google Scholar]
- Dolan, B.; Quirk, C.; Brockett, C. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; p. 350. [Google Scholar]
- Agirre, E.; Banea, C.; Cardie, C.; Cer, D.; Diab, M.; Gonzalez-Agirre, A.; Guo, W.; Mihalcea, R.; Rigau, G.; Wiebe, J. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, 23–24 August 2014; pp. 81–91. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
- Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Ann Arbor, MI, USA, 17 June 2005; pp. 115–124. [Google Scholar]
- Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22 August 2004; pp. 168–177. [Google Scholar]
- Rosenthal, S.; Farra, N.; Nakov, P. SemEval-2017 Task 4: Sentiment Analysis in Twitter. In Proceedings of the SemEval ’17 11th International Workshop on Semantic Evaluation, Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]
- Voorhees, E.M.; Harman, D. Overview of TREC 2002. In Proceedings of the Eleventh Text REtrieval Conference, Gaithersburg, MD, USA, 19–22 November 2002; Available online: https://trec.nist.gov/pubs/trec11/papers/OVERVIEW.11.pdf (accessed on 29 October 2020).
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
- Wiebe, J.; Wilson, T.; Cardie, C. Annotating expressions of opinions and emotions in language. Lang. Resour. Eval. 2005, 39, 165–210. [Google Scholar] [CrossRef]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural questions: A benchmark for question answering research. Trans. Assoc. Comput. Linguist. 2019, 7, 453–466. [Google Scholar] [CrossRef]
- Berant, J.; Chou, A.; Frostig, R.; Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1533–1544. [Google Scholar]
- Marcus, M.; Santorini, B.; Marcinkiewicz, M.A. Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 1993, 19, 313–330. [Google Scholar]
- Wang, J.; Wang, Z.; Zhang, D.; Yan, J. Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 2915–2921. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
- Pang, B.; Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; p. 271. [Google Scholar]
- Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Huang, E.H.; Socher, R.; Manning, C.D.; Ng, A.Y. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, Jeju Island, Korea, 8–14 July 2012; pp. 873–882. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Akbik, A.; Blythe, D.; Vollgraf, R. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018; pp. 1638–1649. [Google Scholar]
- Shuang, K.; Zhang, Z.; Loo, J.; Su, S. Convolution–deconvolution word embedding: An end-to-end multi-prototype fusion embedding method for natural language processing. Inf. Fusion 2020, 53, 112–122. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 29 October 2020).
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2019; pp. 5754–5764. [Google Scholar]
- Botha, J.; Blunsom, P. Compositional morphology for word representations and language modelling. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1899–1907. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. 2013. Available online: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (accessed on 29 October 2020).
- Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Gage, P. A new algorithm for data compression. C Users J. 1994, 12, 23–38. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. 2017. Available online: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf (accessed on 29 October 2020).
- Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
- Schuster, M.; Nakajima, K. Japanese and korean voice search. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5149–5152. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Goldberg, Y. A primer on neural network models for natural language processing. J. Artif. Intell. Res. 2016, 57, 345–420. [Google Scholar] [CrossRef] [Green Version]
- Shi, T.; Liu, Z. Linking GloVe with word2vec. arXiv 2014, arXiv:1411.5595. [Google Scholar]
- Levy, O.; Goldberg, Y. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems; Mit Press: Montréal, QC, Canada, 2014; pp. 2177–2185. [Google Scholar]
- Harris, Z.S. Distributional structure. Word 1954, 10, 146–162. [Google Scholar] [CrossRef]
- Firth, J. A Synopsis of Linguistic Theory 1930–1955. In Studies in Linguistic Analysis; Palmer, F., Ed.; Philological Society: Oxford, UK, 1957. [Google Scholar]
- Levy, O.; Goldberg, Y. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 302–308. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. In Advances in Neural Information Processing Systems; Mit Press: Montréal, QC, Canada, 2015; pp. 3294–3302. [Google Scholar]
- Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
- Hill, F.; Cho, K.; Korhonen, A. Learning distributed representations of sentences from unlabelled data. arXiv 2016, arXiv:1602.03483. [Google Scholar]
- McCann, B.; Bradbury, J.; Xiong, C.; Socher, R. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems; Mit Press: Long Beach, CA, USA, 2017; pp. 6294–6305. [Google Scholar]
- Subramanian, S.; Trischler, A.; Bengio, Y.; Pal, C.J. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv 2018, arXiv:1804.00079. [Google Scholar]
- Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
- Auli, M.; Galley, M.; Quirk, C.; Zweig, G. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1044–1054. [Google Scholar]
- Conneau, A.; Lample, G.; Rinott, R.; Williams, A.; Bowman, S.R.; Schwenk, H.; Stoyanov, V. XNLI: Evaluating cross-lingual sentence representations. arXiv 2018, arXiv:1809.05053. [Google Scholar]
- Socher, R.; Pennington, J.; Huang, E.H.; Ng, A.Y.; Manning, C.D. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK, 27–31 July 2011; pp. 151–161. [Google Scholar]
- Socher, R.; Huval, B.; Manning, C.D.; Ng, A.Y. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, 12–14 July 2012; pp. 1201–1211. [Google Scholar]
- Zhao, H.; Lu, Z.; Poupart, P. Self-adaptive hierarchical sentence model. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
- Yogatama, D.; Blunsom, P.; Dyer, C.; Grefenstette, E.; Ling, W. Learning to compose words into sentences with reinforcement learning. arXiv 2016, arXiv:1611.09100. [Google Scholar]
- Choi, J.; Yoo, K.M.; Lee, S.G. Learning to compose task-specific tree structures. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Drozdov, A.; Verga, P.; Yadav, M.; Iyyer, M.; McCallum, A. Unsupervised latent tree induction with deep inside-outside recursive autoencoders. arXiv 2019, arXiv:1904.02142. [Google Scholar]
- Nakagawa, T.; Inui, K.; Kurohashi, S. Dependency tree-based sentiment classification using CRFs with hidden variables. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 2–4 June 2010; pp. 786–794. [Google Scholar]
- Bowman, S.R.; Gauthier, J.; Rastogi, A.; Gupta, R.; Manning, C.D.; Potts, C. A fast unified model for parsing and sentence understanding. arXiv 2016, arXiv:1603.06021. [Google Scholar]
- Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
- Kalchbrenner, N.; Espeholt, L.; Simonyan, K.; Oord, A.v.d.; Graves, A.; Kavukcuoglu, K. Neural machine translation in linear time. arXiv 2016, arXiv:1610.10099. [Google Scholar]
- Gan, Z.; Pu, Y.; Henao, R.; Li, C.; He, X.; Carin, L. Learning generic sentence representations using convolutional neural networks. arXiv 2016, arXiv:1611.07897. [Google Scholar]
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
- Chung, J.; Cho, K.; Bengio, Y. A character-level decoder without explicit segmentation for neural machine translation. arXiv 2016, arXiv:1603.06147. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
- Kim, Y.; Denton, C.; Hoang, L.; Rush, A.M. Structured attention networks. arXiv 2017, arXiv:1702.00887. [Google Scholar]
- Lin, Z.; Feng, M.; Santos, C.N.d.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. arXiv 2017, arXiv:1703.03130. [Google Scholar]
- Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Pan, S.; Zhang, C. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Zhang, C. Bi-directional block self-attention for fast and memory-efficient sequence modeling. arXiv 2018, arXiv:1804.00857. [Google Scholar]
- Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Wang, S.; Zhang, C. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. arXiv 2018, arXiv:1801.10296. [Google Scholar]
- Liu, Y.; Lapata, M. Learning structured text representations. Trans. Assoc. Comput. Linguist. 2018, 6, 63–75. [Google Scholar] [CrossRef]
- Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
- Guo, Q.; Qiu, X.; Liu, P.; Shao, Y.; Xue, X.; Zhang, Z. Star-transformer. arXiv 2019, arXiv:1902.09113. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. Mass: Masked sequence to sequence pre-training for language generation. arXiv 2019, arXiv:1905.02450. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Word translation without parallel data. arXiv 2017, arXiv:1710.04087. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Model | Input | Repres. | Supervision |
---|---|---|---|
Cho et al. [63] | word | sentence+ | supervised |
Seq2Seq [5] | word | sentence+ | supervised |
Skip-Thoughts [64] | word | sentence+ | unsupervised |
RCNN [65] | word | sentence+ | supervised |
FastSent [66] | word | sentence+ | unsupervised |
CoVe [67] | word | word | supervised |
Akbik et al. [42] | character | word | unsupervised |
ELMo [41] | character | word | unsupervised |
Subramanian et al. [68] | word | sentence+ | semi-supervised |
LASER [69] | BPE | sentence+ | supervised |
Model | Supervision | Tree |
---|---|---|
RAE [72] | un/supervised | latent |
MV-RNN [73] | un/supervised | given |
AdaSent [74] | supervised | latent |
Tree-LSTM [75] | supervised | given |
RL-SPINN [76] | reinforcement | latent |
ST-Gumbel [77] | unsupervised | latent |
DIORA [78] | unsupervised | latent |
Model | Input | Repres. | Supervision |
---|---|---|---|
C&W [81] | word | word | semi-supervised |
DCNN [82] | word | sentence+ | supervised |
CharCNN [48] | character | word | unsupervised |
ByteNet [83] | character | sentence+ | un/supervised |
CNN-LSTM [84] | word | sentence+ | unsupervised |
CDWE [43] | word | word | supervised |
Model | Input | Supervision |
---|---|---|
HAN [88] | word | supervised |
Kim et al. [89] | word | unsupervised |
Lin et al. [90] | word | supervised |
Transformer [51] | BPE | semi-supervised |
GPT [44] | BPE | semi-supervised |
DiSAN [91] | word | supervised |
Bi-BloSAN [92] | word | supervised |
ReSAN [93] | word | supervised |
BERT [54] | WP | unsupervised |
Liu et al. [94] | word | supervised |
GPT-2 [3] | BPE | unsupervised |
XLM [95] | BPE | un/supervised |
Star-Transformer [96] | word | supervised |
Transformer-XL [97] | char. or word | unsupervised |
MASS [98] | BPE | unsupervised |
SBERT [99] | word | un/supervised |
XLNet [45] | SP | unsupervised |
ALBERT [100] | WP | unsupervised |
SpanBERT [101] | WP, word, span | unsupervised |
REALM [7] | WP | semi-supervised |
ELECTRA [102] | WP | unsupervised |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Babić, K.; Martinčić-Ipšić, S.; Meštrović, A. Survey of Neural Text Representation Models. Information 2020, 11, 511. https://doi.org/10.3390/info11110511
Babić K, Martinčić-Ipšić S, Meštrović A. Survey of Neural Text Representation Models. Information. 2020; 11(11):511. https://doi.org/10.3390/info11110511
Chicago/Turabian StyleBabić, Karlo, Sanda Martinčić-Ipšić, and Ana Meštrović. 2020. "Survey of Neural Text Representation Models" Information 11, no. 11: 511. https://doi.org/10.3390/info11110511
APA StyleBabić, K., Martinčić-Ipšić, S., & Meštrović, A. (2020). Survey of Neural Text Representation Models. Information, 11(11), 511. https://doi.org/10.3390/info11110511