In this section, we introduce how Neural Networks represent and manipulate text; in particular, methods to map text from a discrete orthogonal space to a dense compact space, and probabilistic models applied to language.
2.2.1 Vector Semantics and Embeddings.
Before going into the details of
vector semantics, we start with some definitions. We call
vocabulary \(\mathcal {V}\) a set of character sequences, a
word type (or
word)
w is a unique entry in a vocabulary, while a
token represents a word instance in some text. Often, words represent
flexed forms of the same base form, which is called
lemma (e.g., words
,
,
are flexed form of the lemma
). Embedding techniques can be applied to words, tokens or lemmas, to transform them into continuous-values vectors: the
embeddings.
Due to the vocabulary size, Neural Network models tend to grow large in the number of parameters. However, an advantage we shall see of
Seq2Seq models is that, since they build and transform their embedding representation from an entire text sequence –rather than single words– they can leverage
sub-word tokenisation to encode the input text reducing the number of symbols (and thus the parameters to embed such symbols). With this sub-word approach, the vocabulary contains sequences of frequent sub-words. For example,
can be represented as
and
. With this approach, words are decomposed into smaller units (down to single character level) that are the actual constituents of the vocabulary, allowing also to manage
out-of-vocabulary words. Usually, these sub-words units are extracted from data applying dictionary-based
compression algorithms like
Byte-Pair Encoding (
BPE) [
156].
Human language is encoded by means of orthogonal symbols (alphabetic characters, ideograms, diacritics, etc.), which form sequences that group at various granularity levels. We call such groups
words,
sentences,
sections, and so on. Several
discrete representations exist to encode such sequences. For example, the popular
one-hot encoding transforms a word into a vector
\(\mathbf {o} \in \mathbb {1}^{|\mathcal {V}|}\), such that
\(\Vert \mathbf {o}\Vert _2 = 1\) In particular, all the elements are zero except the one corresponding to the word to be encoded, which is set to 1. Note that
\(\mathcal {V}\) is usually a large set (with millions of elements).
Deep learning models based on Neural Networks, however, work better on
dense representations expressed as
tensors,
2 and referred to as
vector semantics. In fact, through vector semantics, it is possible to project human language symbols and sequences into dense, smooth and compressed representations. Thus, the key idea of deep learning models for NLP is to project everything into a continuous
d-dimensional space (where
\(d \ll |\mathcal {V}|\)) and then manipulate such representation. For example, a sequence of tokens
\(X_{sparse}\) is converted into a sequence of vectors
\(X_{dense}\):
where
\(|X_{sparse}| = |X_{dense}| = |X|\),
\(x_t \in \mathcal {V}\) and
\(\mathbf {x}_t \in \mathbb {R}^d\). This sequence can be further converted into a matrix
\(\mathbf {X}\) or a tensor to be processed by a
Seq2Seq model.
A crucial property characterises vectors in this space: they represent the semantic (and, sometimes, syntactic) meaning of the pieces of text they encode [
80]. Figure
14 shows some examples about the encoding of words. Thus, it is possible to compute the
semantic similarity among pieces of text by computing the
distance of their corresponding vectors. These semantic vector representations are called
embeddings, but are in practice feature vectors.
In the last years, various approaches emerged to encode word embeddings, using them as a basic “building block” for models representing more complex, higher-level structures, such as sentences, sections and even whole documents.
Word embeddings. As introduced above, models for word embeddings encode words into a semantic space, where they are represented as d-dimensional vectors. These models can be grouped according to two orthogonal criteria: count-based vs. prediction-based models, and shallow (and thus static) vs. deep (and thus contextual) models.
Shallow models represent the oldest embedding approach [
36,
37,
38,
43,
44,
52]. They are encoded in an embedding matrix
\(\mathbf {W} \in \mathbb {R}^{|\mathcal {V}| \times d}\), where
\(\mathcal {V}\) is the vocabulary and
d is the desired embedding space dimension. The target word’s sparse (one-hot) representation
\(\mathbf {o}\) is used to fetch the word embedding
\(\mathbf {u} \in \mathbb {R}^d\) from the embedding matrix
\(\mathbf {W}\), as:
\(\mathbf {u} = \mathbf {W}^\top \cdot \mathbf {o}\). Notice that the one-hot encoding and the multiplication shown in the previous equation are actually implemented by fetching the word embedding from the matrix (i.e., a row) starting from the index of the target word.
In particular, prediction-based models are trained to predict a target word, given a context window of surrounding words in the corpus samples (Continuous Bag-of-Word - CBoW approach) or to predict the surrounding context words given the target word (skip-gram approach); examples are
Word2Vec [
112,
113] and
fastText [
19]. Instead, count-based models are trained using word co-occurrence counts in the corpus [
12]; see, for example,
GloVe [
126].
Deep contextual models have been around for some time [
13]. However, they gained traction recently, due to the availability of sufficient computational power to train them on large corpora, in a reasonable amount of time. The idea behind such models is to leverage all the elements in the input word sequence to build a sequence of
hidden, compact, vector representations useful to predict the next unknown word (or generic missing words). Hidden representations extracted by these models encapsulate information on both the corresponding input token and all the other tokens of the sequence. Due to this property, we talk of contextual/contextualised embeddings: the entire sequence serves as context to encode all tokens, and this is what gives deep models an advantage over shallow ones.
Contextual models are based on DNNs and, since they are trained to predict the word sequence probability distribution, represent a typology of (probabilistic)
Language Models (
LMs). Thus they fall into the group of predictive models. We refer to Section
2.2.2 for further details on probabilistic language models.
Early deep contextual models were implemented using unidirectional recurrent Neural Networks [
14,
16].
ELMo [
127], instead, was the first example of bi-directional recurrent networks applied for this problem. Nowadays, these models are built using state-of-the-art transformer networks [
101],
GPT [
134] and
BERT [
46] are examples of transformer based language models. Note that, independently of the implementation of the hidden layers, all deep models start from an initial shallow embedding of each word in the sequence. The goal of the hidden layers is thus to refine these initial vectors, generating better, more semantically informative embeddings by incorporating information from the other tokens in the input (context) sequence.
Generalised embeddings. Besides word-level embeddings, other embeddings are employed in NLP. These generalised embeddings try to encode information of longer pieces of text (e.g., sentences, paragraphs, documents, ...) into single vectors. Although deep contextual approaches for word-level embeddings represent the most adopted solution (due to their performances), generalised embeddings still represent a useful tool, as they are simple, fast and –for several NLP tasks– provide good-enough embeddings.
Sentence embeddings represent the most adopted typology of generalised embeddings. They find applications in many fields, like document retrieval, and allow for very compact meaningful representations. Sentence embeddings are divided into two groups:
parametrised and
non-parametrised models. Parametrised models must be trained either through supervised approaches –leveraging corpora for
Semantic Textual Similarity (
STS)
3 or
Natural Language Inference (
NLI)
4 tasks [
142]– or unsupervised approaches, leveraging generic corpora for language modelling [
85,
121]. Instead, non-parametrised models are built on top of word-level embedding models, and thus training is not required [
5,
219].
Parametrised models are similar to word embeddings, and can be either shallow or built on top of deep language models. To train a supervised model of this kind, a labelled corpus on STS or NLI is needed.
Sentence-BERT [
142] is a popular example of these models. Instead, it is sufficient to leverage a generic, unlabelled corpus to train an unsupervised sentence embedding model. Models like
Sent2Vec [
121] and
Skip-Thought [
85], leveraging a self-supervised approach, are examples of model that can leverage unlabelled corpora. They are trained to predict the missing words in a sentence or the following sentence (word by word) in a sequence, respectively.
Non-parametrised models showed that it is possible to achieve meaningful representations simply by combining existing word embeddings. Models like
SIF [
5] or
DynaMax [
219] build their sentence representation starting from the sequence of word embeddings constituting the sentence to encode, and then applying a
weighted average pooling layer or a
max pooling layer, respectively. Although non-parametrised models do not achieve the results of parametrised ones, they are easy to implement and require little computational resources.
Apart from sentence embeddings, other high-level embedding models include documents,
knowledge graphs,
5 and even
speaker persona6 in conversations. These can be employed in many NLP applications, like conversational agents.
2.2.2 Probabilistic Language Models.
Probabilistic language models, or simply LM, are
probability distributions over sequences of words \(P_{LM}(w_1, \ldots , w_i, \ldots , w_n)\) (with
\(w_i \in \mathcal {V}\)) and represent a core tool for NLP [
80].
Seq2Seq Neural Networks can be used to learn probabilistic language models: we can train a deep Neural Network to output the probability of a sequence of tokens as the product of the (conditioned) probability of the individual tokens in a sequence. Recent research showed that training
neural language models (i.e., deep Neural Networks trained as language models) on large amounts of text data allows us to: (i) generates high-quality text (ii) yield very informative features (in the form of contextual embeddings) to be used for discriminative tasks (iii) later fine-tuning with state-of-the-art results on a downstream (generative or discriminative task) [
22,
46,
137]. In general, these networks are trained to minimise the negative
\(\log\)-likelihood of the output sequence
\(P_{LM}(w_1, \ldots , w_i, \ldots , w_n;\vartheta)\).
Approaches. Neural Networks can be used to learn and approximate different language modelling approaches:
causal,
bi-directional, and
transducer (see Figure
15). The approach to language modelling is a result of how the hidden transformation is computed. However, independently of this choice, the end-to-end behaviour of yielding a probability distribution is unchanged.
We talk of
causal language models or
auto-regressive language models or
decoder (only) language models when the LM computes the probability of observing each token in a sequence
\(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) given only the preceding ones (see Figure
15(a)):
These models are trained on tasks like
causal language modelling (predict next token given the preceding one) [
22,
135].
We talk of
bi-directional language models or
auto-encoder language models or
encoder (only) language models when the LM computes the probability of observing each token in a sequence
\(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) given all the tokens present in the sequence, the conditioned probability can be computed on a (possibly) corrupted copy of the original sequence
\(\widetilde{X}\) (see Figure
15(b)):
These models are trained on
masked language modelling (predict the missing tokens from a corrupted input sequence, similar to the denoising auto-encoders objective) [
46,
103].
Finally, we talk of
transducer language models or
encoder-decoder language models when the LM outputs the posterior causal probability of a target sequence
\(Y=\langle y_1, \ldots , y_j, \ldots , y_{|Y|} \rangle \in \mathcal {V}^{|Y|}\) given a separate source sequence
\(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) (see Figure
15(c)):
These models are trained on tasks like
prefix language modelling (similar to causal language modelling, but the first elements of the sequence are visible to the model and they are not used to compute the loss),
span replacement (predict the missing sub-sequences of token from the source), or
de-shuffling (re-order the input sequence of tokens) [
90,
137].
Note that Causal LM models have as output sequence the same input sequence shifted to the left and Bi-directional LM has as output sequence the input sequence with the same alignment (no shifting in any direction). However, the input to Bi-directional LM can be a corrupted version of the output. We underline this concept in Figure
15(b) using
\(\widetilde{X}\) ad input and
X as output. Examples of causal LMs are
GPT [
22,
120,
134,
135],
Bloom [
151],
Gopher [
136],
Chincilla [
71], and
LaMDA [
183]. While, examples of bi-directional LMs are
ELMo [
127]
BERT [
46] or
RoBERTa [
103]. When implemented with Transformer networks, these two approaches to language modelling adopt, respectively, a fully visible masking pattern and causal masking pattern for their self-attention transformations.
On the contrary, Transducer LM work with two separate and orthogonal sequences (the source and the target sequences, respectively
X and
Y) that are both part of the input (the source is the input of the encoder and the target the input of the decoder), but only the target sequence shifted to the left is part of the output.
BART [
90],
T5 [
137,
208,
209],
T0 [
148], and
FLAN [
194] are all examples of transducers LMs. The shifting of the target is due to the autoregressive nature of the decoder in the transducer. In fact, when implemented with Transformer networks, this transducer language model, can be obtained either combining an encoder with fully visible attention and a decoder with causal attention using a fully visible cross-attention in the middle, or with a
non-causal decoder [
193].
In the context of
Dialogue Language Modelling (
DLM) (i.e., language modelling for dialogue) we consider a dialogue
X under two perspectives: either as a plain sequence of tokens or a sequence of
\(n_X\) utterances, each representing a sequence of tokens on its own:
where
with
\(x_t, x_{i,j} \in \mathcal {V}\). Note that given this notation, since the tokens in the plain sequence map bijectively with the tokens in the sequence of utterances, we have that
\(x_{1,1} = x_1\) and
\(x_{n_X, |U_{n_X}|} = x_{|X|}\)From this utterance level division, we can extract all the available context-response pairs:
where
\(U_i \in \mathcal {V}^{|U_i|}\) is a sequence of tokens representing a turn in the dialogue,
\(C_i = \langle U_1, \ldots , U_{i - 1} \rangle = U_{i^{\prime }\lt i}\) is the
context associated to the
ith turn in the dialogue and
\(R_i = U_i\) is the
ith
response (or turn) in the dialogue, with
where
\(r_i, c_j \in \mathcal {V}\). In Section
3 we detail how the aforementioned language modelling approaches are currently adapted for the dialogue task.
Text processing. All the
Seq2Seq Neural Network models for language modelling share the same high-level architecture, as depicted in Figure
17(a): there is an input embedding layer to encode the sequences and transform them from sparse to dense representations, the hidden transformation layers compute the hidden representation of the sequences, and, finally, the output layer yields the posterior probability of observing a token sequence
\(\langle w_1, \ldots , w_i, \ldots , w_n \rangle\) (with
\(w_i \in \mathcal {V}\)). The input sequence of tokens is extracted as in Figure
16.
For each output step, the
Seq2Seq outputs a discrete probability distribution. Starting from this probability distribution, it is possible to apply decoding or sampling to generate text. All the inference uses of these models are visualised in Figure
17.
Independently of the modelling approach, any
Seq2Seq model computes the output probability of a sequence as
where
\(\mathbf {h}_i \in \mathbb {R}^d\) is the contextual embedding corresponding to position
i of the output (
\(1 \le i \le n\)) computed through the hidden transformations
\(h(\cdot)\) of the
Seq2Seq network (
d is the size of the hidden representation),
\(\mathbf {W}_{LM} \in \mathbb {R}^{d \times |\mathcal {V}|}\) is the linear projection layer to compute the
logits (i.e., the unnormalised
\(\log\)-likelihoods), and
\(\mathrm{softmax}(\cdot)\) is the normalised exponential function. Notice that the
\(\mathrm{softmax}(\cdot)\) outputs a vector of
\(|\mathcal {V}|\) elements, that is the discrete probability distribution over the possible tokens, we retain the
\(w_i\)-th element to have the probability of the token in that position.
The input embedding layer takes care of projecting each token into a continuous vector space (the process is depicted in Figure
19). This representation is then transformed by the hidden layers. In more recent Transformer models, the input includes position embeddings, to take into account positional information [
186].
The output layer is a final linear transformation followed by a
\(\mathrm{softmax}(\cdot)\) activation. This final transformation is highly demanding in terms of computation costs, due to the high dimensional size of the output. In fact, the projection matrix is
\(\mathbf {W}_{LM} \in \mathbb {R}^{|\mathcal {V}| \times d}\), where
d is the dimension of the hidden feature vectors and
\(\mathcal {V}\) is potentially large. In fact, before the introduction of sub-word tokenisers [
87,
156,
204] which reduced considerably the value of
\(|\mathcal {V}|\), it was common practice to constrain
\(\mathcal {V}\) to the most frequent tokens [
92], or substitute the
\(\mathrm{softmax}(\cdot)\) activation with its hierarchical variant [
113].
The input layer and the final output layer are linear projections whose dimensions have the same semantic meaning. Taking advantage of this aspect, many models rely on
weight tying (or
weight sharing) [
78,
132], using the same parameters for the embedding and output layers. In this way, the number of parameters is considerably reduced.
The hidden transformations are the actual
Seq2Seq Neural Network. The choice of the hidden transformation directly influences the language modelling approaches. Unidirectional (forward) recurrent networks and self-attention transformers with causal attention mask pattern is used to build causal language models [
22,
135]. Bi-directional recurrent networks and self-attention transformers with fully visible attention-mask patterns are used to build bi-directional language models [
46,
103]. Encoder-decoder recurrent networks, encoder decoder-decoder transformer networks or non-causal transformer networks with prefix mask pattern are used to build transducer language models [
90,
137].