tutorial

Open access

A Primer on Seq2Seq Models for Generative Chatbots

Authors:

Vincenzo Scotti,

Licia Sbattella,

Roberto TedescoAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 3

Article No.: 75, Pages 1 - 58

https://doi.org/10.1145/3604281

Published: 06 October 2023 Publication History

PDF eReader

Abstract

The recent spread of Deep Learning-based solutions for Artificial Intelligence and the development of Large Language Models has pushed forwards significantly the Natural Language Processing area. The approach has quickly evolved in the last ten years, deeply affecting NLP, from low-level text pre-processing tasks –such as tokenisation or POS tagging– to high-level, complex NLP applications like machine translation and chatbots. This article examines recent trends in the development of open-domain data-driven generative chatbots, focusing on the Seq2Seq architectures. Such architectures are compatible with multiple learning approaches, ranging from supervised to reinforcement and, in the last years, allowed to realise very engaging open-domain chatbots. Not only do these architectures allow to directly output the next turn in a conversation but, to some extent, they also allow to control the style or content of the response. To offer a complete view on the subject, we examine possible architecture implementations as well as training and evaluation approaches. Additionally, we provide information about the openly available corpora to train and evaluate such models and about the current and past chatbot competitions. Finally, we present some insights on possible future directions, given the current research status.

1 Introduction

Dialogue agents (also known as conversational agents) have always been a long-running goal of Artificial Intelligence (AI) since the very beginning of this research field. It is interesting to note that even the well-known Turing test for artificial intelligence was designed around the conversational capabilities of the machine. Since then, dialogue capabilities and machine intelligence have been thought as tightly connected.

Nowadays, Natural Language Processing (NLP), the sub-field of AI focused on human language, offers several approaches to design and implement these agents. Traditionally, NLP literature divides conversational agents into task-oriented and open-domain [80], as can be seen in the taxonomy depicted in Figure 1. Task-oriented agents (sometimes called goal-oriented agents) are thought to be part of an application, acting as an interface, and are further divided into finite-state and frame-based. Instead, open-domain agents (or chatbots) are designed for entertainment and chit-chatting. These agents, in other words, try to give the user the impression of chatting with another human being. Chatbots are divided into rule-based and data-driven, depending on whether the response is produced following a set of handcrafted rules [191, 196] or following some pattern learnt through statistics or machine learning from dialogic corpora [32, 220]. Finally, data-driven chatbots are further organised into generative and retrieval. The chatbots from the former group generate the response from scratch, while the latter select the response from a pool of available candidates.

Fig. 1.

Due to the inherent complex structure of human dialogues, designing and implementing open-domain conversational agents is harder than designing and implementing task-oriented ones, which can rely on the constraints given by the task to accomplish. A popular way to cope with the problem is leveraging machine learning to build the agent, thus delegating to the learning algorithm the duty of identifying the patterns necessary to carry out the conversation. Initially, these chatbots were developed using retrieval approaches due to the high computational power and huge corpora needed to train generative models. In the last years, however, the increased computational power achieved via General-Purpose computing on graphics processing units (GPGPU) enabled the exploitation of Deep Neural Network (DNN) models to solve many AI problems. With this computational power, training generative models for NLP become possible, yielding very good results where the generated sequences of text were “natural” enough to allow entertaining human-machine conversations [57, 76].

Conversational agents need to deal with sequential data, mainly text. Thus, conversational agents adopt specific Neural Network architectures designed for sequence processing. These architectures for sequence modelling are often called Sequence-to-Sequence (Seq2Seq) since they take sequences as input and produce sequences as output. With the spread of Deep Learning (DL) techniques in NLP and the availability of large, pre-trained, Language Models, Seq2Seq have become the standard architecture to solve many NLP-related tasks, including the development of generative chatbots, as depicted in the example of Figure 2. The application of such architectures, however, is neither limited to generative solutions nor restricted to open-domain conversational agents. In fact, the Seq2Seq architecture is actually compatible with retrieval chatbots or task-oriented agents. In general, Seq2Seq can be seen as a very generic and powerful tool for dealing with a broad range of NLP tasks.

We divide the rest of this article into the following sections. In Section 2 we present the main building blocks from deep learning to create Seq2Seq neural conversational agents. In Section 3 we explain how the aforementioned blocks can be exploited to realise these agents. Finally, in Section 5, we sum up the content of this article and outline the expected future directions and open issues. Additionally, we provide two appendices with material for training and evaluation of Seq2Seq neural conversational agents. In Section A we present the approaches to train and evaluate the neural Seq2Seq chatbots. In Section B we provide details about the available corpora to train chatbots (at different levels of granularity) and the available competitions to test them.

Fig. 2.

2 Background

In this section, we provide preliminary information to understand how DNN models work in general and how Seq2Seq DNNs are built in particular. Additionally, we explain how natural language is encoded into a continuous representation to be used with DNNs and how NLP uses Seq2Seq DNNs.

2.1 Neural Networks for Sequence Modelling

In this section, we introduce the DNN framework applied to sequence processing. We provide an overview on DL and how DNNs work in general. Subsequently, we focus on the two main DNN models for sequence analysis: recurrent Neural Networks and transformer Neural Networks, the main building blocks of Seq2Seq networks.

2.1.1 Deep Neural Networks.

Artificial Neural Networks (or simply Neural Networks) are a flexible and powerful machine learning framework compatible with supervised, unsupervised, and reinforcement learning [62]. They were inspired by the biological neurons of the brain (hence the name). We visualise Artificial Neural Networks in Figure 3.

Fig. 3.

Neurons in the brain are organised into a massive graph. Each is a processing unit that receives electrochemical inputs from other neurons and fires (is triggered) an output signal once the accumulated input reaches a certain threshold. This Neural Network model translates into the mathematical formulation of Equation (1)

\begin{equation} \mathbf {y} = f(\mathbf {x}; \mathbf {\vartheta }) = g(\mathbf {W}^\top \cdot \mathbf {x} + \mathbf {b}) , \end{equation}

(1)

where \(\mathbf {x} \in \mathbb {R}^{d_{in}}\) is the input vector, \(\mathbf {y} \in \mathbb {R}^{d_{out}}\) is the output vector (with \(d_{in} \in \mathbb {N}^+\) and \(d_{out} \in \mathbb {N}^+\) being respectively the input and output sizes), \(\vartheta \equiv \lbrace \mathbf {W} \in \mathbb {R}^{d_{in} \times d_{out}}, \mathbf {b} \in \mathbb {R}^{d_{out}}\rbrace\) is the set of weights (composed of the matrix \(\mathbf {W}\) and the bias vector \(\mathbf {b}\)), and \(g(\cdot)\) is an activation function, like the sigmoid \(\sigma (\cdot)\) and \(\tanh (\cdot)\), or like more complex ones as \(\textrm {softmax}(\cdot)\), \(\textrm {ReLU}(\cdot)\), and so on.

Initial research on Artificial Neural Networks produced independently the Perceptron [108, 146] and the ADAptive LINear Element (ADALINE) [199], two learning frameworks for linear models respectively used for classification problems (i.e., predicting a categorical value from the input) and regression problem (i.e., predicting a continuous value from the input). See Figure 3(a) for further details on the structure of the linear model.

Both models were limited by their linear nature, which could be overcome by adding one or more hidden transformation layers with a non-linear activation inside these models [143]. These multi-layer Neural Networks (see Figure 3(b) for further details on the structure of the multi-layer, non-linear, model), often referred to as feed-forward Neural Networks, can ideally approximate (and thus learn) any function, if a sufficient number of neurons in the hidden layer is provided [18]. The output of a feed-forward Neural Network is computed as in Equation (2)

\begin{equation} \mathbf {y} = f(\mathbf {x}; \mathbf {\vartheta }_{out}) = g_{out}(\mathbf {W}_{out}^\top \cdot g_h(\mathbf {W}_h^\top \cdot \mathbf {x} + \mathbf {b}_h) + \mathbf {b}_{out}) = g_{out}(\mathbf {W}_{out}^\top \cdot h(\mathbf {x}) + \mathbf {b}_{out}) = g_{out}(\mathbf {W}_{out}^\top \cdot \mathbf {h} + \mathbf {b}_{out}) \end{equation}

(2)

were \(\mathbf {h} \in \mathbb {R}^{d_h}\) is the activation of the hidden layer, with \(d_h\) being the number of hidden neurons, \(\mathbf {W}_h \in \mathbb {R}^{d_{in} \times d_h}\), \(\mathbf {b}_h \in \mathbb {R}^{d_h}\), \(\mathbf {W}_{out} \in \mathbb {R}^{d_h \times d_{out}}\), and \(\mathbf {b}_{out} \in \mathbb {R}^{d_{out}}\) are, respectively the weights and biases of the hidden and the output layer, and \(g_h(\cdot)\) and \(g_{out}(\cdot)\) are, respectively, the activations of the hidden and the output layer and \(h(\cdot)\) is the hidden transformation function (note that in the formulation we considered a single hidden layer – however, it can be extended to a generic number of hidden layers). In the last ten years, thanks to the increased computation power of GPGPU co-processors, we managed to train feed-forward Neural Networks with a large number of layers, referred to as DNNs. These networks can learn a hierarchy of features in their hidden layers that, provided a sufficient number of training samples, can be transferred and reused for other tasks and in different domains [62].

Neural Networks are parametrised by a set of real-valued weights \(\vartheta\). To train these models, it is necessary to define a differentiable loss function (or cost function, or objective function) \(\mathcal {L}(\cdot ; \vartheta)\) to optimise. \(\mathcal {L}(\cdot ; \vartheta)\) depends on the parameters of the Neural Network. Common choices for a loss function are the negative \(\log\)-likelihood for classification problems and the Mean Square Error (MSE) for regression problems. Training the Neural Network means to iteratively minimise the loss function, updating the weights at each iteration. This iterative optimisation process is usually realised via the Gradient Descent algorithm (or some variants like Stochastic Gradient Descent and Mini-Batch Gradient Descent) [62]. Through the years, many different stochastic optimisations algorithms based on Gradient Descent have been developed [42, 83, 165] to help find the global optimum of the loss function corresponding to the best weights combination, avoiding local optima. These stochastic optimisation algorithms leverage the Backpropagation algorithm to compute the gradient of the loss function with respect to the weights of the Neural Network.

Due to the complexity and depth of DNNs, many techniques are adopted to regularise and speed up their training process, like weight regularisation [18], dropout [177], hidden representation normalisation [6], and residual connections [68]. In fact, despite being a powerful learning tool, many precautions need to be adopted to get the best from these models and not get stuck in local optima during training. Moreover, due to the size of these models, and the huge corpora they usually need, training them from scratch is a very onerous process. For this reason, large models are often trained by big research centres or companies, and publicly released so that they only need to be refined, by means of fine-tuning, on a more specific task.

Traditional Artificial Neural Networks, process data samples in form of real values vectors \(\mathbf {x} \in \mathbb {R}^d\). Each element of this vector represent an individual feature characterising a sample, this is why we talk of feature vectors (see Figure 4(a)). However, these network are a generic tool that can be extended to n-dimensional inputs. We introduce the concepts of n-dimensional tensor to indicate generic n-dimensional vector \(\mathsf {X} \in \mathbb {R}^{d_1 \times d_2 \times \ldots \times d_n}\) and that of n-dimensional feature map, which indicates an \(n+1\)-dimensional tensor, we visualise these structures in Figure 4. Features maps are collections of feature vectors, organised in a spatial structure. In-fact, some problems allow to exploit the spatial structure of the input in the construction of the DNN. In the case of text and speech, we can see the inputs as sequences of feature vectors, thus as matrices, or two-dimensional tensors, or one-dimensional feature maps (where time is the spatial dimension, see Figure 4(b)). In the case of images, we can see the inputs as a grid of feature vectors, thus as three-dimensional tensors, or two-dimensional feature maps (where the height and width are spatial dimensions, see Figure 4(c)).

Fig. 4.

Due to language’s temporal (i.e., sequential) structure [80], standard Neural Networks are unsuitable for language modelling. In fact, to model human language it is necessary to adopt models capable of capturing long-term sequential dependencies in a variable-length context. To this end, convolutional, recurrent and transformer networks are the models of choice as they allow to leverage the spatial structure of the input [80]. Despite convolutional networks being a valid tool for sequence processing, we will focus solely on recurrent and transformer networks as they represent more powerful modelling tools.

In the NLP field, such models have been shown to be powerful tools for building discriminative and generative Seq2Seq Neural Networks for speech and text processing. Recurrent and transformer networks are usually stacked in a sequence of iterative transformations composing a DNN, and trained on large data collections.

The term Seq2Seq refers to the class of networks built using models that take as input a sequence of vectors

\begin{equation} X = \langle \mathbf {x}_1, \ldots , \mathbf {x}_t, \ldots , \mathbf {x}_{|X|} \rangle , \end{equation}

(3)

(with \(\mathbf {x_t} \in \mathbb {R}^{d_x}\)) and generates another sequence of vectors

\begin{equation} Y = \langle \mathbf {y}_1, \ldots , \mathbf {y}_{t^{\prime }}, \ldots , \mathbf {y}_{|Y|} \rangle , \end{equation}

(4)

(with \(\mathbf {y_{t^{\prime }}} \in \mathbb {R}^{d_y}\)), as depicted in Figure 5. The hidden transformations of Seq2Seq networks work on sequences of feature vectors encoded into one-dimensional feature maps. Such feature maps are actually matrices: \(\mathbf {X} \in \mathbb {R}^{d_x \times |X|}\) and \(\mathbf {Y} \in \mathbb {R}^{d_y \times |Y|}\) represent the input and output sequences X and Y. The two 1D feature maps are matrices of \(|X|\) or \(|Y|\) column vectors, where each column vector \(\mathbf {x}_t\) or \(\mathbf {y}_{t^{\prime }}\) encodes the tth or \(t^{\prime }\)-th position in the sequence.

\begin{equation} \mathbf {X} = [ \mathbf {x}_1, \ldots , \mathbf {x}_t, \ldots , \mathbf {x}_{|X|}], \end{equation}

(5)

\begin{equation} \mathbf {Y} = [ \mathbf {y}_1, \ldots , \mathbf {y}_{t^{\prime }}, \ldots , \mathbf {y}_{|Y|}]. \end{equation}

(6)

Fig. 5.

Seq2Seq models can handle sequences of undefined length and can process multiple sequences in batch (in this latter case, padding with null values is applied so that all sequences in a batch have the same length¹ – see Figure 5). The overall Seq2Seq network acts as a function, directly yielding the output sequence \(\mathbf {Y} = f(\mathbf {X}; \vartheta)\).

2.1.2 Recurrent Neural Networks.

Recurrent Neural Networks were the first proposed solution for applying Artificial Neural Networks to a possibly unbound time horizon [80] (see Figure 6). In other words, such networks can be applied to a sequence whose length is neither known a priori not bounded to some maximum value. In a recurrent layer, a cell takes part in a loop where a hidden vector, representing the sequence’s past (i.e., the accumulated memory), is recurred through the sequence steps (see Figure 7). Interestingly, these networks have been proven to be Turing complete [170].

Fig. 6.

Fig. 7.

Several variants of recurrent networks exist: vanilla Recurrent Neural Networks (RNNs) [54, 79], Long Short Term Memories (LSTMs) [70] and Gated Recurrent Units (GRUs) [34]. All these networks can be used to scan a sequence from left to right or vice versa. It is also possible to have bi-directional networks [153] to include information from both sides of the sequence when processing each sequence element (see Figure 8).

Fig. 8.

RNN. Vanilla RNNs represented the first important step towards sequence modelling with Neural Networks (see Figure 6(a)). These networks leverage a hidden memory vector (also called hidden context vector) \(\mathbf {h}_t \in \mathbb {R}^{d_h}\) to accumulate past information through the sequence. For each element \(\mathbf {x}_t \in \mathbb {R}^{d_x}\) of the input sequence and the corresponding hidden memory vector from the previous step \(\mathbf {h}_{t-1}\), the RNN updates such hidden memory vector and computes the output \(\mathbf {y}_t \in \mathbb {R}^{d_y}\) for the current time step t.

Elman’s original formulation [54] prescribed to compute the hidden memory vector \(\mathbf {h}_t\), given \(\mathbf {x}_t\) and \(\mathbf {h}_{t-1}\), as reported in Equation (7) and then calculate \(\mathbf {y}_t\) as reported in Equation (8). Matrices \(\mathbf {W}_x \in \mathbb {R}^{d_x \times d_h}\) and \(\mathbf {W}_h \in \mathbb {R}^{d_h \times d_h}\), and vectors \(\mathbf {b}_h \in \mathbb {R}^{d_h}\) are the weights of the hidden layer, matrix \(\mathbf {W}_y \in \mathbb {R}^{d_h \times d_y}\) and vector \(\mathbf {b}_y \in \mathbb {R}^{y}\) are the weights of the output layer, and \(\sigma (\cdot)\) is the sigmoid function:

\begin{equation} \mathbf {h}_t = \sigma \left(\mathbf {W}_x^\top \cdot \mathbf {x}_t + \mathbf {W}_h^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_h\right)\!, \end{equation}

(7)

\begin{equation} \mathbf {y}_t = \sigma \left(\mathbf {W}_y^\top \cdot \mathbf {h}_t + \mathbf {b}_y\right)\! . \end{equation}

(8)

Instead, in Jordan’s RNNs, \(\mathbf {h}_t\) is updated using the \(\mathbf {y}_{t-1}\) instead of \(\mathbf {h}_{t-1}\), as reported in Equation (9):

\begin{equation} \mathbf {h}_t = \sigma \left(\mathbf {W}_x^\top \cdot \mathbf {x}_t + \mathbf {W}_h^\top \cdot \mathbf {y}_{t-1} + \mathbf {b}_h\right)\! . \end{equation}

(9)

RNNs are trained by means of the backpropagation through time approach, which implies calculating the derivatives throughout the input sequence and computing the product of all these values. The multiplications of the partial derivatives, yield by the chain rule, to compute the overall gradient of the error could lead to the explosion or the vanishing of the norm of such gradient (making it impossible to train the network) [125]. As a consequence, vanilla RNNs are not suitable for tackling long-term dependencies, which are common in natural language.

LSTM. LSTMs were designed to overcome some limitations of the original RNNs [70]. Despite the sound mathematical formulation behind RNNs, training those networks turns out to be a hard task [125]. This is due to their lack of robustness to noise and the aforementioned vanishing and exploding gradient problems.

LSTMs solved the noise robustness and vanishing gradient issues, enabling the modelling of longer sequences, through gating mechanisms. The exploding gradient issue may still occur even in these networks. However, the use of techniques like gradient clipping [125] can help to cope with this issue.

At each sequence step, these LSTM networks take as input \(\mathbf {x}_t \in \mathbb {R}^{d_x}\), yield an output hidden vector \(\mathbf {h}_t \in \mathbb {R}^{d_h}\) –which is passed to and updated by each step of the sequence analysis– and maintain an internal cell state vector (or memory vector) \(\mathbf {c}_t \in \mathbb {R}^{d_h}\), which is updated at each step and represents the current status of the network. This network cell uses three gates to model the time sequence: the input gate \(\mathbf {i}_t \in \mathbb {R}^{d_h}\) controls the amount of new information set in the cell state (see Equation (10)), the forget gate \(\mathbf {f}_t \in \mathbb {R}^{d_h}\) controls the amount of information discarded from the cell state (see Equation (11)), and, finally, the output gate \(\mathbf {o}_t \in \mathbb {R}^{d_h}\) controls the amount of information set in the output (see Equation (12)). Given these gates, \(\mathbf {h}_t\) and \(\mathbf {c}_t\) are then updated according to Equations (14) and (13), respectively.

\begin{equation} \mathbf {i}_t = \sigma \left(\mathbf {W}_{xi}^\top \cdot \mathbf {x}_t + \mathbf {W}_{hi}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_i\right), \end{equation}

(10)

\begin{equation} \mathbf {f}_t = \sigma \left(\mathbf {W}_{xf}^\top \cdot \mathbf {x}_t + \mathbf {W}_{hf}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_f\right), \end{equation}

(11)

\begin{equation} \mathbf {o}_t = \sigma \left(\mathbf {W}_{xo}^\top \cdot \mathbf {x}_t + \mathbf {W}_{ho}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_o\right). \end{equation}

(12)

\begin{align} \nonumber \nonumber\\ \ \ \mathbf {c}_t = \mathbf {f}_t \circ \mathbf {c}_{t-1} &+ \mathbf {i}_t \circ \tanh \left(\mathbf {W}_{xc}^\top \cdot \mathbf {x}_t + \mathbf {W}_{hc}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_c\right)\!. \end{align}

(13)

\begin{align} \mathbf {h}_t &= \mathbf {y}_t = \mathbf {o}_t \circ \tanh \left(\mathbf {c}_t\right). \end{align}

(14)

In 10 11 12 14 13 the matrices \(\mathbf {W}_{xi}, \mathbf {W}_{xo}, \mathbf {W}_{xf} \in \mathbb {R}^{d_x \times d_h}\) and \(\mathbf {W}_{hi}, \mathbf {W}_{ho},\) \(\mathbf {W}_{hf} \in \mathbb {R}^{d_h \times d_h}\), and the vectors \(\mathbf {b}_{i}, \mathbf {b}_{o}, \mathbf {b}_{f} \in \mathbb {R}^{d_h}\) are the parameters of the LSTM cells, and symbol \(\circ\) is the Hadamard product (or element-wise product) operator. Figure 6(b) shows the typical structure on an LSTM cell. GRU. GRUs, like LSTMs, leverage a gating mechanism to control memory [34], but reduce such gates to two. GRUs maintain the same computational capabilities as LSTMs and, on small corpora, they show better results [35].

The gates of a GRU cell are called update and reset. The update gate is a combination of the input and output gate of LSTMs and computes \(\mathbf {z}_t \in \mathbb {R}^{d_h}\) as shown in Equation (15). The reset gate plays the same role as the LSTM forget gate and calculates \(\mathbf {r}_t \in \mathbb {R}^{d_h}\) as shown in Equation (16). Unlike LSTMs cells, GRUs use a single hidden vector \(\mathbf {h}_t \in \mathbb {R}^{d_h}\) to represent both memory and output. Finally, the output \(\mathbf {y}_t = \mathbf {h}_t\), at each step, is computed as in Equation (17). Figure 6(c) shows the typical structure of a GRU cell.

\begin{equation} \mathbf {z}_t = \sigma \left(\mathbf {W}_{xz}^\top \cdot \mathbf {x}_t + \mathbf {W}_{hz}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_z \right)\! , \end{equation}

(15)

\begin{equation} \mathbf {r}_t = \sigma \left(\mathbf {W}_{xr}^\top \cdot \mathbf {x}_t + \mathbf {W}_{hr}^\top \cdot \mathbf {h}_{t-1} + \mathbf {b}_r \right)\! , \end{equation}

(16)

\begin{equation} \mathbf {y}_t = \mathbf {h}_t = \left(1 - \mathbf {z}_t \right) \circ \mathbf {h}_{t-1} + \mathbf {z}_t \circ \tanh \left(\mathbf {W}_{xh}^\top \cdot \mathbf {x}_t + \mathbf {W}_{rh}^\top \cdot \left(\mathbf {r}_t \circ \mathbf {h}_{t-1} \right) + \mathbf {b}_h \right)\!. \end{equation}

(17)

Bi-directional. Bi-directionality was introduced to improve the modelling capabilities of RNNs [153]. This approach is applicable to any of the aforementioned recurrent cells.

Usually, RNNs used in NLP apply only a forward analysis, as shown in Figure 7(b). Thus, each hidden state holds the information coming from the preceding positions. The bi-directional approach prescribes scanning the sequence forwards and backwards, and extracting distinct hidden vectors per direction \(\overrightarrow{\mathbf {h}}_t\) and \(\overleftarrow{\mathbf {h}}_t\), as depicted in Figure 8(b). Then, \(\overrightarrow{\mathbf {h}}_t\) and \(\overleftarrow{\mathbf {h}}_t\) can be combined in many different ways (e.g., sum or concatenation). It is also possible to learn a separate pointwise transformation \(\mathbf {W}_h\) to combine the two vectors, as in the example of Equation (18), where \(f_{combine}(\cdot)\) can be any operation combining two vectors (e.g., concatenation, sum):

\begin{equation} \mathbf {h}_t = \mathbf {W}_h^\top \cdot \left(f_{combine}\left(\overrightarrow{\mathbf {h}}_t, \overleftarrow{\mathbf {h}}_t\right)\right)\!. \end{equation}

(18)

As can be inferred from Equation (18), the improved modelling capabilities given by the bi-directional approach come at the cost of increased memory requirements. In fact, while uni-directional models only need to store and update a single memory vector, the bidirectional approach requires keeping in memory the entire sequence of hidden memory states for both directions. However, this requirement does not hold when the last hidden state is the only information the task needs, for example when doing a sequence summary.

2.1.3 Transformer Neural Networks.

Transformer networks arose from the need to overcome the so-called sequential limitations of RNNs. In fact, RNNs necessarily require a sequential input processing and thus operations cannot be parallelised across the sequence: RNNs need to process \(\mathbf {x}_t\) before moving to \(\mathbf {x}_{t+1}\). Instead, Transformer modules rely on highly parallelisable transformations. Moreover, Transformers yielded better results than RNNs because of their improved (i.e., longer) context management capabilities [46, 134].

Fig. 9.

Transformer architectures revolve around the attention mechanism [7]. This mechanism was initially developed for RNNs as a methodology to find the correspondences between the elements of input and output sequences, but eventually, it completely substituted the sequence analysis modules [101, 186]. The original application was in machine translation [186], were the effect of the learnt alignment is to find connections as in Figure 13. Attention mechanism. The attention mechanism allows a network to extract and use information from arbitrarily large contexts [80]. Attention layers learn to weight a combination of feature vectors, depending on some query.

In its generic formulation, the attention mechanism takes two 1D feature maps of variable length as input: the source \(\mathbf {X} \in \mathbb {R}^{d_x \times m}\) and the target \(\mathbf {Y} \in \mathbb {R}^{d_y \times n}\). The target \(\mathbf {Y}\) is used to generate the queries \(\mathbf {Q} \in \mathbb {R}^{d_k \times n}\) (i.e., \(\mathbf {Y}\) describes the required information), while \(\mathbf {X}\) is used to provide keys \(\mathbf {K} \in \mathbb {R}^{d_k \times m}\) and values \(\mathbf {V} \in \mathbb {R}^{d_v \times m}\), which are calculated through linear projections as \(\mathbf {Q} = \mathbf {W}_q^\top \cdot \mathbf {Y}\), \(\mathbf {K} = \mathbf {W}_k^\top \cdot \mathbf {X}\), and \(\mathbf {V} = \mathbf {W}_v^\top \cdot \mathbf {X}\). Such matrices \(\mathbf {W}_q \in \mathbb {R}^{d_y \times d_k}, \mathbf {W}_k \in \mathbb {R}^{d_x \times d_k}, \mathbf {W}_v \in \mathbb {R}^{d_x \times d_v}\) contain the trainable parameters of the attention mechanism. All operations are illustrated in Figure 9.

The overall attention mechanism and the computation of the attention function \(f_{attention}(\cdot)\) is described in Equation (19) and depicted in Figure 11(a). Keys \(\mathbf {K}\) are matched against the queries \(\mathbf {Q}\) through a scoring function (the scoring function is the cross-correlation, computed as \(\mathbf {Q}^\top \cdot \mathbf {K}\)). These scores are then scaled (by a \({1}/{\sqrt {d_k}}\) factor, to have unit variance of the scores) and normalised row-wise through a \(\mathrm{softmax}(\cdot)\) function (in this way the resulting values will sum up to 1). The computed values in the resulting matrix are used as weights in the linear combination of values \(\mathbf {V}\). They represent how much information to retain from the value \(\mathbf {v}_j\) (the jth row of \(\mathbf {V}\)), corresponding to the key \(\mathbf {k}_j\), when computing the combination for query \(\mathbf {q}_i\):

\begin{equation} \mathbf {H}_{att}^\top = f_{attention}(\mathbf {Q}, \mathbf {K}, \mathbf {V}) = \textrm {softmax}\left(\frac{\mathbf {Q}^\top \cdot \mathbf {K}}{\sqrt {d_k}}\right) \cdot \mathbf {V}^\top , \end{equation}

(19)

where \(\mathbf {H}_{att} \in \mathbb {R}^{d_v \times n}\) is the output feature map composed of the output hidden vectors, each corresponding to a position of the target feature map.

Fig. 10.

Fig. 11.

To explain this approach, consider the classic language translation task, where X and Y represent the sentence to be translated and a candidate translation. We want to know how strong the relationship between any word in the sentence and any word of the candidate translation is (in this way, we provide the rest of the translation network with further information that can be useful for selecting the best translation). Thus, we compute Q and K, which are a way of projecting X and Y to a common space where they can be matched against each other. The resulting matching weights are applied to V, which is still X but projected on the space we need to obtain as output.

The attention mechanism presented so far is the generic cross-attention. Transformer networks mostly use the self-attention variant, where the source and target are the same: \(\mathbf {X} = \mathbf {Y}\). Self-attention relates different positions of a single sequence \(\mathbf {X}\) in order to compute a more effective representation of \(\mathbf {X}\). Moreover, self-attention allows for masking patterns in the computation of attention.

A masking pattern is represented by a matrix \(\mathbf {M}\) whose values are either 0 or \(-\infty\) (see Figure 10) and is summed to the matrix of key-query scores: \(\mathbf {Q}^\top \cdot \mathbf {K} + \mathbf {M}\). If all the elements of \(\mathbf {M}\) are 0, the sequence is fully visible, otherwise, \(\mathbf {M}\) is used to prevent the attention from considering some specific positions. For example, if an element \(m_{i,j} = -\infty\), the weight corresponding to that position, after the \(\mathrm{softmax}(\cdot)\) normalisation becomes 0 (i.e., no information from the corresponding value is retained). This fully-visible attention is typical of bi-directional encoder models, the encoder part of a transducer model, and the cross-attention of transducers as well, as we will explain in Sections 2.2.2 and 3.1.

If the upper right triangular sub-matrix is set to \(-\infty\), the mask forces casualty (i.e., sequentiality) in the computation of the output hidden representation. This covered attention is typical of causal decoders models, and the decoders parts of transducers, as we explain in Sections 2.2.2 and 3.1. Finally, a prefix mask that is causal only for the right portion of the mask, and it is fully visible in the left portion. This pattern permits to have fully visible attention to the initial part of the input and to analyse autoregressively the remaining part. The partially covered attention, which is similar to a combination of the patterns used by transducers, is rarely used and allows building a non-causal decoder [193]: a single model that is a mixture of a bi-directional encoder and a causal decoder without the intermediate cross-attention [9, 137].

Fig. 12.

Fig. 13.

Masking is also used to deal with sequences of varying lengths when doing batch processing. In fact, since sequences in a batch are required to have the same length, they need to be padded to reach the required size: the columns and rows of \(\mathbf {M}\) corresponding to padded positions are set to \(-\infty\). This approach holds for both self- and cross-attention.

To capture the different parallel relationships that could occur within a sequence, the attention is often parallelised through multiple attention heads (see Figure 11(b)). Each of the h heads computes the attention transformation with its weights (\(\mathbf {W}_{q,i}\), \(\mathbf {W}_{k,i}\), \(\mathbf {W}_{v,i}\), with \(i \in [1, h] \subseteq \mathbb {N}\)), independently of the others. The outputs are then concatenated along the feature dimension and merged through a further linear transformation, as described by Equation (20):

\begin{equation} \mathbf {H}_{mha}^\top = f_{multi-head attention}(\mathbf {X}, \mathbf {Y}) = \mathbf {W}_o^\top \cdot (\mathbf {H}_{{att},1} \oplus \ldots \oplus \mathbf {H}_{{att},h}) , \end{equation}

(20)

where \(\mathbf {H}_{{att},i} \in \mathbb {R}^{d_v \times n}\) is the output of the ith head, \(\mathbf {W}_o \in \mathbb {R}^{(d_v \cdot h) \times d_{mha}}\) is a matrix of trainable parameters, \(\oplus\) represents the concatenation operator (applied along the feature axis), and \(\mathbf {H}_{mha} \in \mathbb {R}^{d_{mha} \times n}\) is the resulting hidden representation.

Transformer blocks. Transformer networks (see Figure 12(c)) build their hidden representations by stacking transformer blocks [186] (see Figure 12(a) and 12(b)). Blocks alternate a multi-head attention sub-block (yielding an intermediate representation \(\mathbf {H}_{mha} \in \mathbb {R}^{d_{mha} \times n}\)) with a Feed-Forward Neural Network (FFNN) sub-block (yielding an output representation \(\mathbf {H} \in \mathbb {R}^{d_h \times n}\)). Usually, each sub-block adds a residual connection around the transformation and a layer-normalisation step [6] right after the sum of the residual connection (in some cases, the order of the operations is different [135]).

The FFNN is a point-wise non-linear transformation. As can be seen in Equation (21), this transformation is computed as a sequence of linear projections with a non-linear activation function in the middle:

\begin{equation} \mathbf {h}_i = f_{FFNN}(\mathbf {h}_{{mha},i}) = \mathbf {W}_{out}^\top \cdot \mathrm{ReLU}\left(\mathbf {W}_{in}^\top \cdot \mathbf {h}_{{mha},i} + \mathbf {b}_{in}\right) + \mathbf {b}_{out} , \end{equation}

(21)

where \(\mathbf {h}_{{mha},i} \in \mathbb {R}^{d_h}\) is a single hidden vector corresponding to the i-th position in the input sequence, while matrices \(\mathbf {W}_{in} \in \mathbb {R}^{d_{mha} \times d_{ffnn}}\), \(\mathbf {W}_{out} \in \mathbb {R}^{d_{ffnn} \times d_h}\) and vectors \(\mathbf {b}_{in} \in \mathbb {R}^{d_{ffnn}}\), \(\mathbf {b}_{out} \in \mathbb {R}^{d_h}\) contain the trainable parameters of \(f_{FFNN}(\cdot)\). The non-linear activation function does not need to be a \(\mathrm{ReLU}(\cdot)\); \(\mathrm{GELU}(\cdot)\) is used in some implementations [46, 134].

The main point of transformer blocks is that all the transformations in the same layer can be computed in parallel. Thus the process is not slowed down by the sequential analysis. The drawback is in the increased memory requirement. However, software and hardware techniques exist to reduce space complexity (and also improve the temporal complexity) [110, 110, 133].

In the case of transducer architectures with encoder and decoder (more on this in Section 3.1) there is also an additional cross-attention sub-block between the self-attention and the FFNN sub-blocks. The output from the first self-attention transformation \(\mathbf {H}_{mha}\) is used as the target in a further attention transformation sub-block. However, this sub-block is a cross-attention sub-block and not a self-attention one: a separate source \(\mathbf {X}\) is used to compute keys and values. This cross-attention transformation is done to find alignment between input and output sequences. Usually, as depicted in Figure 12(c), the input sequence is pre-processed through an encoder and the output sequence through a decoder (this transducer approach is not mandatory, however, as we explain in Section 2.2.2).

2.2 Text Representation and Processing

In this section, we introduce how Neural Networks represent and manipulate text; in particular, methods to map text from a discrete orthogonal space to a dense compact space, and probabilistic models applied to language.

2.2.1 Vector Semantics and Embeddings.

Before going into the details of vector semantics, we start with some definitions. We call vocabulary \(\mathcal {V}\) a set of character sequences, a word type (or word) w is a unique entry in a vocabulary, while a token represents a word instance in some text. Often, words represent flexed forms of the same base form, which is called lemma (e.g., words

are flexed form of the lemma

). Embedding techniques can be applied to words, tokens or lemmas, to transform them into continuous-values vectors: the embeddings.

Due to the vocabulary size, Neural Network models tend to grow large in the number of parameters. However, an advantage we shall see of Seq2Seq models is that, since they build and transform their embedding representation from an entire text sequence –rather than single words– they can leverage sub-word tokenisation to encode the input text reducing the number of symbols (and thus the parameters to embed such symbols). With this sub-word approach, the vocabulary contains sequences of frequent sub-words. For example,

can be represented as

and

. With this approach, words are decomposed into smaller units (down to single character level) that are the actual constituents of the vocabulary, allowing also to manage out-of-vocabulary words. Usually, these sub-words units are extracted from data applying dictionary-based compression algorithms like Byte-Pair Encoding (BPE) [156].

Human language is encoded by means of orthogonal symbols (alphabetic characters, ideograms, diacritics, etc.), which form sequences that group at various granularity levels. We call such groups words, sentences, sections, and so on. Several discrete representations exist to encode such sequences. For example, the popular one-hot encoding transforms a word into a vector \(\mathbf {o} \in \mathbb {1}^{|\mathcal {V}|}\), such that \(\Vert \mathbf {o}\Vert _2 = 1\) In particular, all the elements are zero except the one corresponding to the word to be encoded, which is set to 1. Note that \(\mathcal {V}\) is usually a large set (with millions of elements).

\begin{equation} \mathbf {o} = [ 0, \ldots , 0, 1, 0, \ldots , 0 ]. \end{equation}

(22)

Deep learning models based on Neural Networks, however, work better on dense representations expressed as tensors,² and referred to as vector semantics. In fact, through vector semantics, it is possible to project human language symbols and sequences into dense, smooth and compressed representations. Thus, the key idea of deep learning models for NLP is to project everything into a continuous d-dimensional space (where \(d \ll |\mathcal {V}|\)) and then manipulate such representation. For example, a sequence of tokens \(X_{sparse}\) is converted into a sequence of vectors \(X_{dense}\):

\begin{equation} X_{sparse} = \langle x_1, \ldots , x_t, \ldots , x_{|X|} \rangle \rightarrow X_{dense} = \langle \mathbf {x}_1, \ldots , \mathbf {x}_t, \ldots , \mathbf {x}_{|X|} \rangle , \end{equation}

(23)

where \(|X_{sparse}| = |X_{dense}| = |X|\), \(x_t \in \mathcal {V}\) and \(\mathbf {x}_t \in \mathbb {R}^d\). This sequence can be further converted into a matrix \(\mathbf {X}\) or a tensor to be processed by a Seq2Seq model.

\begin{equation} X_{dense} = \langle \mathbf {x}_1, \ldots , \mathbf {x}_t, \ldots , \mathbf {x}_{|X|} \rangle \rightarrow \mathbf {X} \in \mathbb {R}^{|X| \times d}. \end{equation}

(24)

Fig. 14.

A crucial property characterises vectors in this space: they represent the semantic (and, sometimes, syntactic) meaning of the pieces of text they encode [80]. Figure 14 shows some examples about the encoding of words. Thus, it is possible to compute the semantic similarity among pieces of text by computing the distance of their corresponding vectors. These semantic vector representations are called embeddings, but are in practice feature vectors.

In the last years, various approaches emerged to encode word embeddings, using them as a basic “building block” for models representing more complex, higher-level structures, such as sentences, sections and even whole documents.

Word embeddings. As introduced above, models for word embeddings encode words into a semantic space, where they are represented as d-dimensional vectors. These models can be grouped according to two orthogonal criteria: count-based vs. prediction-based models, and shallow (and thus static) vs. deep (and thus contextual) models.

Shallow models represent the oldest embedding approach [36, 37, 38, 43, 44, 52]. They are encoded in an embedding matrix \(\mathbf {W} \in \mathbb {R}^{|\mathcal {V}| \times d}\), where \(\mathcal {V}\) is the vocabulary and d is the desired embedding space dimension. The target word’s sparse (one-hot) representation \(\mathbf {o}\) is used to fetch the word embedding \(\mathbf {u} \in \mathbb {R}^d\) from the embedding matrix \(\mathbf {W}\), as: \(\mathbf {u} = \mathbf {W}^\top \cdot \mathbf {o}\). Notice that the one-hot encoding and the multiplication shown in the previous equation are actually implemented by fetching the word embedding from the matrix (i.e., a row) starting from the index of the target word.

In particular, prediction-based models are trained to predict a target word, given a context window of surrounding words in the corpus samples (Continuous Bag-of-Word - CBoW approach) or to predict the surrounding context words given the target word (skip-gram approach); examples are Word2Vec [112, 113] and fastText [19]. Instead, count-based models are trained using word co-occurrence counts in the corpus [12]; see, for example, GloVe [126].

Deep contextual models have been around for some time [13]. However, they gained traction recently, due to the availability of sufficient computational power to train them on large corpora, in a reasonable amount of time. The idea behind such models is to leverage all the elements in the input word sequence to build a sequence of hidden, compact, vector representations useful to predict the next unknown word (or generic missing words). Hidden representations extracted by these models encapsulate information on both the corresponding input token and all the other tokens of the sequence. Due to this property, we talk of contextual/contextualised embeddings: the entire sequence serves as context to encode all tokens, and this is what gives deep models an advantage over shallow ones.

Contextual models are based on DNNs and, since they are trained to predict the word sequence probability distribution, represent a typology of (probabilistic) Language Models (LMs). Thus they fall into the group of predictive models. We refer to Section 2.2.2 for further details on probabilistic language models.

Early deep contextual models were implemented using unidirectional recurrent Neural Networks [14, 16]. ELMo [127], instead, was the first example of bi-directional recurrent networks applied for this problem. Nowadays, these models are built using state-of-the-art transformer networks [101], GPT [134] and BERT [46] are examples of transformer based language models. Note that, independently of the implementation of the hidden layers, all deep models start from an initial shallow embedding of each word in the sequence. The goal of the hidden layers is thus to refine these initial vectors, generating better, more semantically informative embeddings by incorporating information from the other tokens in the input (context) sequence.

Generalised embeddings. Besides word-level embeddings, other embeddings are employed in NLP. These generalised embeddings try to encode information of longer pieces of text (e.g., sentences, paragraphs, documents, ...) into single vectors. Although deep contextual approaches for word-level embeddings represent the most adopted solution (due to their performances), generalised embeddings still represent a useful tool, as they are simple, fast and –for several NLP tasks– provide good-enough embeddings.

Sentence embeddings represent the most adopted typology of generalised embeddings. They find applications in many fields, like document retrieval, and allow for very compact meaningful representations. Sentence embeddings are divided into two groups: parametrised and non-parametrised models. Parametrised models must be trained either through supervised approaches –leveraging corpora for Semantic Textual Similarity (STS)³ or Natural Language Inference (NLI)⁴ tasks [142]– or unsupervised approaches, leveraging generic corpora for language modelling [85, 121]. Instead, non-parametrised models are built on top of word-level embedding models, and thus training is not required [5, 219].

Parametrised models are similar to word embeddings, and can be either shallow or built on top of deep language models. To train a supervised model of this kind, a labelled corpus on STS or NLI is needed. Sentence-BERT [142] is a popular example of these models. Instead, it is sufficient to leverage a generic, unlabelled corpus to train an unsupervised sentence embedding model. Models like Sent2Vec [121] and Skip-Thought [85], leveraging a self-supervised approach, are examples of model that can leverage unlabelled corpora. They are trained to predict the missing words in a sentence or the following sentence (word by word) in a sequence, respectively.

Non-parametrised models showed that it is possible to achieve meaningful representations simply by combining existing word embeddings. Models like SIF [5] or DynaMax [219] build their sentence representation starting from the sequence of word embeddings constituting the sentence to encode, and then applying a weighted average pooling layer or a max pooling layer, respectively. Although non-parametrised models do not achieve the results of parametrised ones, they are easy to implement and require little computational resources.

Apart from sentence embeddings, other high-level embedding models include documents, knowledge graphs,⁵ and even speaker persona⁶ in conversations. These can be employed in many NLP applications, like conversational agents.

2.2.2 Probabilistic Language Models.

Probabilistic language models, or simply LM, are probability distributions over sequences of words \(P_{LM}(w_1, \ldots , w_i, \ldots , w_n)\) (with \(w_i \in \mathcal {V}\)) and represent a core tool for NLP [80]. Seq2Seq Neural Networks can be used to learn probabilistic language models: we can train a deep Neural Network to output the probability of a sequence of tokens as the product of the (conditioned) probability of the individual tokens in a sequence. Recent research showed that training neural language models (i.e., deep Neural Networks trained as language models) on large amounts of text data allows us to: (i) generates high-quality text (ii) yield very informative features (in the form of contextual embeddings) to be used for discriminative tasks (iii) later fine-tuning with state-of-the-art results on a downstream (generative or discriminative task) [22, 46, 137]. In general, these networks are trained to minimise the negative \(\log\)-likelihood of the output sequence \(P_{LM}(w_1, \ldots , w_i, \ldots , w_n;\vartheta)\).

Approaches. Neural Networks can be used to learn and approximate different language modelling approaches: causal, bi-directional, and transducer (see Figure 15). The approach to language modelling is a result of how the hidden transformation is computed. However, independently of this choice, the end-to-end behaviour of yielding a probability distribution is unchanged.

Fig. 15.

We talk of causal language models or auto-regressive language models or decoder (only) language models when the LM computes the probability of observing each token in a sequence \(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) given only the preceding ones (see Figure 15(a)):

\begin{equation} P_{causal LM}(X) = \prod _{i=1}^{|X|}P(x_i|x_{1}, \ldots , x_{i - 1}) = \prod _{i=1}^{|X|}P(x_i|X_{i^{\prime }\lt i}) . \end{equation}

(25)

These models are trained on tasks like causal language modelling (predict next token given the preceding one) [22, 135].

We talk of bi-directional language models or auto-encoder language models or encoder (only) language models when the LM computes the probability of observing each token in a sequence \(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) given all the tokens present in the sequence, the conditioned probability can be computed on a (possibly) corrupted copy of the original sequence \(\widetilde{X}\) (see Figure 15(b)):

\begin{equation} P_{bi-directional LM}(X) = \prod _{i=1}^{|X|}P(x_i|\tilde{x}_1, \ldots , \tilde{x}_{|X|}) = \prod _{i=1}^{|X|}P(x_i|\widetilde{X}) . \end{equation}

(26)

These models are trained on masked language modelling (predict the missing tokens from a corrupted input sequence, similar to the denoising auto-encoders objective) [46, 103].

Finally, we talk of transducer language models or encoder-decoder language models when the LM outputs the posterior causal probability of a target sequence \(Y=\langle y_1, \ldots , y_j, \ldots , y_{|Y|} \rangle \in \mathcal {V}^{|Y|}\) given a separate source sequence \(X=\langle x_1, \ldots , x_i, \ldots , x_{|X|} \rangle \in \mathcal {V}^{|X|}\) (see Figure 15(c)):

\begin{equation} P_{transducer LM}(Y|X) = \prod _{j=1}^{|Y|}P(y_i|X, y_{1}, \ldots , y_{j-1}) = \prod _{j=1}^{|Y|}P(y_i|X, Y_{j^{\prime }\lt j}) . \end{equation}

(27)

These models are trained on tasks like prefix language modelling (similar to causal language modelling, but the first elements of the sequence are visible to the model and they are not used to compute the loss), span replacement (predict the missing sub-sequences of token from the source), or de-shuffling (re-order the input sequence of tokens) [90, 137].

Note that Causal LM models have as output sequence the same input sequence shifted to the left and Bi-directional LM has as output sequence the input sequence with the same alignment (no shifting in any direction). However, the input to Bi-directional LM can be a corrupted version of the output. We underline this concept in Figure 15(b) using \(\widetilde{X}\) ad input and X as output. Examples of causal LMs are GPT [22, 120, 134, 135], Bloom [151], Gopher [136], Chincilla [71], and LaMDA [183]. While, examples of bi-directional LMs are ELMo [127] BERT [46] or RoBERTa [103]. When implemented with Transformer networks, these two approaches to language modelling adopt, respectively, a fully visible masking pattern and causal masking pattern for their self-attention transformations.

On the contrary, Transducer LM work with two separate and orthogonal sequences (the source and the target sequences, respectively X and Y) that are both part of the input (the source is the input of the encoder and the target the input of the decoder), but only the target sequence shifted to the left is part of the output. BART [90], T5 [137, 208, 209], T0 [148], and FLAN [194] are all examples of transducers LMs. The shifting of the target is due to the autoregressive nature of the decoder in the transducer. In fact, when implemented with Transformer networks, this transducer language model, can be obtained either combining an encoder with fully visible attention and a decoder with causal attention using a fully visible cross-attention in the middle, or with a non-causal decoder [193].

In the context of Dialogue Language Modelling (DLM) (i.e., language modelling for dialogue) we consider a dialogue X under two perspectives: either as a plain sequence of tokens or a sequence of \(n_X\) utterances, each representing a sequence of tokens on its own:

\begin{equation} X = \langle x_1, \ldots , x_t, \ldots , x_{|X|} \rangle = \langle U_1, \ldots , U_i, \ldots , U_{n_X} \rangle , \end{equation}

(28)

where

\begin{equation} U_i = \langle x_{i,1}, \ldots , x_{i,j}, \ldots , x_{i,|U_i|} \rangle , \end{equation}

(29)

with \(x_t, x_{i,j} \in \mathcal {V}\). Note that given this notation, since the tokens in the plain sequence map bijectively with the tokens in the sequence of utterances, we have that \(x_{1,1} = x_1\) and \(x_{n_X, |U_{n_X}|} = x_{|X|}\)

From this utterance level division, we can extract all the available context-response pairs:

\begin{equation} X = \langle U_1, \ldots , U_i, \ldots , U_{n_X} \rangle \rightarrow \langle (C_1, R_1), \ldots , (C_i, R_i), \ldots , (C_{n_X}, R_{n_X}) \rangle , \end{equation}

(30)

where \(U_i \in \mathcal {V}^{|U_i|}\) is a sequence of tokens representing a turn in the dialogue, \(C_i = \langle U_1, \ldots , U_{i - 1} \rangle = U_{i^{\prime }\lt i}\) is the context associated to the ith turn in the dialogue and \(R_i = U_i\) is the ith response (or turn) in the dialogue, with

\begin{equation} R = \langle r_1, \ldots , r_i, \ldots , r_{|R|} \rangle , \end{equation}

(31)

\begin{equation} C = \langle c_1, \ldots , c_j, \ldots , c_{|C|} \rangle , \end{equation}

(32)

where \(r_i, c_j \in \mathcal {V}\). In Section 3 we detail how the aforementioned language modelling approaches are currently adapted for the dialogue task.

Text processing. All the Seq2Seq Neural Network models for language modelling share the same high-level architecture, as depicted in Figure 17(a): there is an input embedding layer to encode the sequences and transform them from sparse to dense representations, the hidden transformation layers compute the hidden representation of the sequences, and, finally, the output layer yields the posterior probability of observing a token sequence \(\langle w_1, \ldots , w_i, \ldots , w_n \rangle\) (with \(w_i \in \mathcal {V}\)). The input sequence of tokens is extracted as in Figure 16.

Fig. 16.

For each output step, the Seq2Seq outputs a discrete probability distribution. Starting from this probability distribution, it is possible to apply decoding or sampling to generate text. All the inference uses of these models are visualised in Figure 17.

Independently of the modelling approach, any Seq2Seq model computes the output probability of a sequence as

\begin{equation} f_{LM}(w_1, \ldots , w_i, \ldots , w_n; \vartheta) = P_{LM}(w_1, \ldots , w_i, \ldots , w_n) = \prod _{i = 1}^{n} \mathrm{softmax}\left(\mathbf {W}_{LM}^\top \cdot \mathbf {h}_i\right)_{w_i} , \end{equation}

(33)

where \(\mathbf {h}_i \in \mathbb {R}^d\) is the contextual embedding corresponding to position i of the output (\(1 \le i \le n\)) computed through the hidden transformations \(h(\cdot)\) of the Seq2Seq network (d is the size of the hidden representation), \(\mathbf {W}_{LM} \in \mathbb {R}^{d \times |\mathcal {V}|}\) is the linear projection layer to compute the logits (i.e., the unnormalised \(\log\)-likelihoods), and \(\mathrm{softmax}(\cdot)\) is the normalised exponential function. Notice that the \(\mathrm{softmax}(\cdot)\) outputs a vector of \(|\mathcal {V}|\) elements, that is the discrete probability distribution over the possible tokens, we retain the \(w_i\)-th element to have the probability of the token in that position.

Fig. 17.

The input embedding layer takes care of projecting each token into a continuous vector space (the process is depicted in Figure 19). This representation is then transformed by the hidden layers. In more recent Transformer models, the input includes position embeddings, to take into account positional information [186].

The output layer is a final linear transformation followed by a \(\mathrm{softmax}(\cdot)\) activation. This final transformation is highly demanding in terms of computation costs, due to the high dimensional size of the output. In fact, the projection matrix is \(\mathbf {W}_{LM} \in \mathbb {R}^{|\mathcal {V}| \times d}\), where d is the dimension of the hidden feature vectors and \(\mathcal {V}\) is potentially large. In fact, before the introduction of sub-word tokenisers [87, 156, 204] which reduced considerably the value of \(|\mathcal {V}|\), it was common practice to constrain \(\mathcal {V}\) to the most frequent tokens [92], or substitute the \(\mathrm{softmax}(\cdot)\) activation with its hierarchical variant [113].

The input layer and the final output layer are linear projections whose dimensions have the same semantic meaning. Taking advantage of this aspect, many models rely on weight tying (or weight sharing) [78, 132], using the same parameters for the embedding and output layers. In this way, the number of parameters is considerably reduced.

The hidden transformations are the actual Seq2Seq Neural Network. The choice of the hidden transformation directly influences the language modelling approaches. Unidirectional (forward) recurrent networks and self-attention transformers with causal attention mask pattern is used to build causal language models [22, 135]. Bi-directional recurrent networks and self-attention transformers with fully visible attention-mask patterns are used to build bi-directional language models [46, 103]. Encoder-decoder recurrent networks, encoder decoder-decoder transformer networks or non-causal transformer networks with prefix mask pattern are used to build transducer language models [90, 137].

3 Basic Architectures

In this section, we present suitable Seq2Seq architectures that can be used to implement generative chatbots. These models build up on probabilistic language models introduced in Section 2.2.2, for this reason, we often talk about Dialogue Language Models/Modelling (DLM). In general, deep generative chatbots model the probability of observing a response \(R = \langle r_1, \ldots , r_i, \ldots , r_{|R|} \rangle\) (the current utterance in a dialogue) given the context \(C = \langle c_1, \ldots , c_j, \ldots , c_{|C|} \rangle\) (the concatenation of the utterances in a dialogue prior to the response) –with \(r_i, c_j \in \mathcal {V}\), the vocabulary– as a conditional probability \(P_{DLM}(R|C)\).

Independently of the actual architecture, there are two basic input/output pipeline flows, as for any other Neural Network: training and inference.

As reported in Figure 18, when used at training time, the model takes the context C (i.e., in general, previous information about the state of the conversation like the previous turns) and the target response R as input, and generates a distribution over the response tokens; such output distribution is matched against the target response to compute the loss and update the model’s weights.

Fig. 18.

When used at inference time, the model takes only the context C as input and yields the output distribution over response tokens, from which the predicted response \(\widehat{R}\) is derived by means of decoding (and then possibly used as an additional input), as we explain in Section 4.2.

DLMs follow the same preparation and processing steps that apply to generic language models. We visualise the process again in Figure 19 to account for the differences due to the dialogue structure.

3.1 Base Models

The most important part of a model is given by its set of hidden transformations, which characterise the entire architecture and affect how the output is computed (see Figure 20). In the following, we describe the three main approaches to designing the architecture of the agent (and thus its hidden transformations). Notice that these concepts do not depend on the configuration of the Neural Network: training or inference.

Fig. 19.

Fig. 20.

3.1.1 Causal Decoder.

Causal models are the oldest architectures (see Figure 20(a)). These chatbots process context tokens and response tokens as a single sequence one after the other, shifting information from left to right, at each step. When processing the response, the input of a given step includes the output token generated at the previous one. Thus, due to such an autoregressive structure, the posterior probability of any token in the response is computed considering the context and all the preceding tokens (see Equation (34)).

\begin{equation} P_{DLM}(R|C) = \prod _{i = 1}^{|R|}{P(r_i|C, R_{i^{\prime }\lt i})} . \end{equation}

(34)

This architecture was initially implemented using uni-directional RNNs [92, 144, 189]. Nowadays, Transformer blocks are used for the same purpose [99, 202, 216], where the self-attention transformation employs a causal attention mask pattern to prevent the attention from considering “future” tokens.

This causal decoder dialogue language model is the most common approach to Seq2Seq chatbots. As examples, see ChatGPT [118], Bard [63], Bing Chat [111], Gopher [136], the current version of BlenderBot [86], TransferTransfo [202], CAiRE [99], and DialoGPT [216], which are all Transformer network-based chatbots.

3.1.2 Bi-directional Encoder.

In general bi-directional encoder models are used to build discriminative models for classification and regression, rather than generative ones. However, this kind of architecture allows for generative solutions too. In fact, auto-encoders have been employed for neural machine translation [65, 81, 89, 96], and generic text generation [192]. Thus, auto-encoders can be considered suitable architectures for generative chatbots [76].

Similarly to causal solutions, auto-encoder chatbots process in sequence context tokens and response tokens as part of the same sequence. Unlike causal chatbots, however, all the tokens are processed in parallel rather than in sequence.

The most straightforward solution is to start from some initial blank representation⁷ of the response \(\widetilde{R}\), and then fill in parallel these blank response positions, as depicted in Figure 20(b) (note the two empty boxes). The resulting model starts from such blank tokens and computes for each of them, in parallel, the response token distributions. This transformation can be repeated multiple times, refining the generated distribution. This behaviour is similar to that of a denoising auto-encoder [89]. There are other approaches, but they require complex decoding schemes [96, 192] or the use of hierarchical models [89, 192] (we present the hierarchical group in Section 3.2).

In general, in non-autoregressive models like auto-encoders, the posterior probability of observing one of the response tokens is considered independent of the other response tokens. However, as shown in Equation (35), the probability of observing the ith response token depends on all the context and response tokens. In fact, the hidden representation in correspondence of the tokens in \(\widetilde{R}\) is built using all the input positions, and the output probabilities are computed from those hidden vectors.

\begin{equation} P_{DLM}(R|C) = \prod _{i = 1}^{|R|}{(r_i|C, \widetilde{R})} . \end{equation}

(35)

This chatbot architecture can be implemented using bi-directional RNNs or Transformers with fully-visible attention masking patterns. However, this approach is rarely used since encoder models are usually employed for discriminative tasks, rather than generative ones. Thus, these models perform better on tasks like retrieval question answering [69].

Note that, since for each response the number of blank tokens to provide as input must be known in advance, the length of such response must be set a priori. Alternatively, it is possible to extend the Neural Networks with a module able to predict the output length [89]. This length can be used for adding the required blank tokens.

3.1.3 Transducer.

The transducer models combine ideas from the two previous architectures and were initially introduced for machine translation [7, 179, 186].

These chatbots split the context encoding step (the red part in Figure 20(c)) from the response generation one (the blue part in Figure 20(c)). An alignment module between these two parts is needed (the connection between encoded C and R of Figure 20(c)), to leverage the encoded context information during the response generation step. In Transformer network-based implementations, use a fully visible masking pattern, while the decoder self-attention uses a causal masking pattern.

Transducer, sometimes encoder-decoder, chatbots are designed to process the response in an autoregressive way. Thus, the posterior output probability of observing the response is the same as the one of the causal models (see Equation (34)). Unlike causal and auto-encoder models, in encoder-decoder approaches the hidden representations of context and response do not necessarily share the same vector space. In fact, context and response are encoded separately, and then some alignment strategy (e.g., cross-attention) is used to align them.

This architecture has been implemented with RNNs, usually relying on a unidirectional approach, for both encoder and decoder [93]. However, the encoder could be realised with a bi-directional RNN [166]. Then, the hidden states generated by the encoder are just passed to the decoder, realising the alignment. An example of chatbots using this approach with RNNs is XiaoIce [220].

More modern approaches are based on Transformers with a fully-visible attention masking pattern in the encoder self-attention and decoder cross-attention and a causal masking pattern in the decoder self-attention (the latter is to achieve autoregressive generation), like Meena [1] or the first version of BlenderBot [145]. The additional cross-attention block is used to align the current response token (i.e., the attention query) with the input context (i.e., the keys and values of the attention). Alternatively, there are some examples of architectures adopting prefix mask patterns (the non-causal architectures), where full attention is used on the context, and causal attention on the response [9, 10].

3.2 Hierarchical Models

Human language has an inherent hierarchical structure [59, 159, 188]. Some models have been proposed to capture this hierarchical aspect in conversation, like Ventola’s [188] or the DAMSL [4]. One of the high-level aspects of human language, during conversations, is the so-called dialogue act, which represents the speaker’s intention and the effect it has on a listener. For example, a question, a statement, or a request for action. This hierarchical structure enables humans to reason on high-level aspects, like the desired dialogue act, before producing the utterances that realise it.

This hierarchy has been approximated in neural conversational agents (both open-domain and task-oriented) to improve them. We distinguish between simple latent hierarchical models and variational latent hierarchical models. These models can be trained using either continuous [159, 160] or discrete [10, 149] latent vector representations.

3.2.1 Latent Hierarchical Models.

Latent hierarchical Seq2Seq models have been introduced to improve the chatbot’s dialogue modelling capabilities with the goal of helping them to manage longer conversations. These architectures represented a significant improvement for models based on recurrent networks, as they couldn’t manage very long contexts. In particular, latent hierarchical Seq2Seq models cope with longer conversations by introducing explicit components to manage the high-level aspects of the conversation (the green part in Figure 21(a)). Although, nowadays, attention-based models handle long conversations in a better way (removing the need of explicit components to manage the context at a high-level) latent hierarchical Seq2Seq models are still used for conditioning the generation on high-level aspects. Note, however, that such aspects are usually explicit –in the sense of human understandable– rather than latent, and thus latent hierarchical Seq2Seq models are rarely applied (more on this in Section 4.3).

The core idea of hierarchical chatbots is to predict a hidden latent representation, encoding the entire response into a single vector (a turn embedding), and then proceed with the token sequence generation exploiting such hidden latent representation. Referring to Figure 21(a), the latent representation is the grey box, derived from the context which is used to guide the response generation, as if it was a “compressed” response that the decoder “expands”. During training, the chatbot learns the implicit high-level dialogue model, which predicts the latent response embedding (i.e., a latent representation of the entire chatbot’s response), and the explicit low-level model, which predicts the responses’ tokens.

Early solutions of this kind adopted RNNs and an encoder-decoder approach [175], like the HRED models [159] and XiaoIce [220], the latter uses the hierarchical approach to condition the response generation on its empathetic-response targets. Given a turn of the conversation, the encoder part of the Seq2Seq model extracts a single hidden vector representing the entire turn and passes it to a high-level recurrent network. This high-level network is realised through a separate RNN working directly on latent representations: the input is the sequence of turn embeddings and the output is the response turn embedding at each time step (again, the grey box of Figure 21(a)). The predicted turn embedding is used to condition the low-level generation in the decoder by concatenating it to each response token embeddings.

In theory, the hierarchical approach also applies to Transformer networks (it has been adopted for task-oriented dialogue [66, 150]). Moreover, it also works on architectures other than the encoder-decoder. Finally, it is worth to mention that it is also compatible with discrete representations, not only continuous ones [149].

Fig. 21.

3.2.2 Variational Hierarchical Models.

The hierarchical models were quickly extended into variational solutions (see Figure 21(b)), similarly to the Variational Auto-Encoders (VAEs) [84, 185]. In VAEs the hidden representation is a random variable \(\mathbf {z}\), this variable is sampled using the reparametrisation trick to make the entire process differentiable.

At training time, the hidden latent representation \(\mathbf {z}\) is sampled from the approximated posterior distribution \(q(\mathbf {z} | C, R)\) (\(q(\mathbf {z} | C, R) \approx p(\mathbf {z}| C, R)\)). At inference time (i.e., when predicting a new response), instead, the prior distribution of the latent given the context \(p(\mathbf {z}|C)\) is used in place of the posterior to sample \(\mathbf {z}\). In both cases, the latent \(\mathbf {z}\), possibly with the context, is used to condition the generation of the response from its likelihood \(p(R|C, \mathbf {z})\).

Despite being conceptually similar to vanilla hierarchical chatbots, variational solutions have a regularised hidden space for the latent representation \(\mathbf {z}\), which enables an easy exploration of the latent space allowing for the generation of diverse responses [9, 10, 218]. Moreover, while vanilla versions do not alter the training process, the variational models require a slightly different objective to optimise (more on this in Section A.1).

NLP has already explored this technique for diverse text generation, using variational latent space representations [21, 217]. Early chatbots implementing these approaches were still based on recurrent models, like VHRED model family [158, 160, 218], since they were the direct extensions to the previous hierarchical chatbot models. As before, this variational approach is compatible with Transformer networks; some of such networks even adopted a discrete latent representation (see PLATO [9, 10]) to reduce the variance of latent space and produce more interpretable representations.

4 Advanced Topics

Up to now, we have described architectures that leverage probabilistic language models as generative chatbots. There are further aspects to consider when using these architectures. In particular, hybrid models, decoding, and conditioning.

In the probabilistic language models, we presented so far as generative dialogue models, decoding is essentially sampling the output probability distribution yielded by these models and generating a response. But, in some cases, we want the generated text to have certain properties or attributes. In this sense, we introduce the concepts of conditioning (or controlled decoding) and prompting. Moreover, the models introduced up to now only rely on generative mechanisms, while, in some cases, having a hybrid model capable also of retrieval approaches may improve the general quality.

4.1 Hybrid Models

Some solutions for generative chatbots showed improvements when combining the retrieval approach with the generative one. In particular, we distinguish between multi-objective models and retrieve-and-refine models.

4.1.1 Multi-objective.

Multi-objective models use a single architecture that combines the two approaches to implement data-driven chatbots: generative and a retrieval [9, 10, 202]. The idea is to leverage two distinct heads on top of the hidden representations, one to do the usual language modelling and the other to classify the responses into correct \(\mathcal {C}_{correct}\) and incorrect \(\mathcal {C}_{correct}\) ones.

This additional retrieval head implicitly learns a scoring function by predicting the matching probability of the context and a response (i.e., the probability of a given response to belong to the correct responses class \(\mathcal {C}_{correct}\)). This scoring function can be used to rank possible responses: responses more suitable for the given context should have an high probability score and responses not suitable for the given context should have a low probability score. We call them retrieval chatbots because this scoring function can be used to search a data set of possible responses for the best match (the highest-ranked response with respect to the current context in the collection).

The retrieval head \(f_{CLS}(\cdot)\) computes the posterior probability of a response to belong to \(\mathcal {C}_{correct}\), as in Equation (36):

\begin{equation} P_{CLS}(R \in \mathcal {C}_{correct}|C) = f_{CLS}(C, R) = \sigma \left(\mathbf {w}_{CLS}^\top \cdot \mathbf {h}_{CLS}\right)\! , \end{equation}

(36)

where \(\sigma (\cdot)\) is the sigmoid function, \(\mathbf {h}_{CLS} \in \mathbb {R}^d\) is the hidden vector representing the sequence to classify and \(\mathbf {w}_{CLS} \in \mathbb {R}^d\) is the vector of learnable weights of the retrieval head.

Since we have a binary classifier, we can compute the probability for a response to be correct, and then derive the probability to be incorrect from the first one, as in Equation (37). Thus, the idea is to obtain a high probability for correct responses \(R_{+}\) and a low probability for incorrect responses \(R_{-}\).

\begin{equation} P_{CLS}(R \in \mathcal {C}_{incorrect}|C) = P_{CLS}(R \notin \mathcal {C}_{correct}|C) = 1 - P_{CLS}(R \in \mathcal {C}_{correct}|C) . \end{equation}

(37)

Given a conversation corpus, an incorrect response (sometimes called distractor or contrastive sample) can be easily obtained sampling a random substitute response from the corpus. Moreover, some solutions have been proposed to avoid raw sampling and leverage semantic similarity between contexts or responses when selecting such distractors [27].

During training, as we explain in Section A.1, the model uses a combination of language modelling and retrieval objectives. During inference, the two heads can be either used independently or it is possible to re-rank the responses generated by the language modelling head, according to the probability predicted by the retrieval head [9, 10].

Examples of models using this multi-objective approach are TransferTransfo [202], CAiRE [99], and PLATO [9, 10]. CAiRE also uses an additional objective to predict user’s emotion.

4.1.2 Retrieve-and-refine.

Retrieve-and-refine models were introduced to cope with two issues of basic generative models: the generation of dull responses and the so-called “hallucination of knowledge” [86, 145, 198, 206] (the latter refer to the case where agents generate responses without actually knowing the information it is talking about or without referring to some knowledge base, leading to possibly wrong/misleading information). This approach aims at generating the response starting from the context sequence and an additional sequence, selected by a retrieval model, that can be either a possible response [198], or some external knowledge retrieved from the web [86].

Given the context of the conversation, it is possible to leverage a pre-trained retrieval model to select a candidate response \(R_{candidate}\) from a corpus, or a segment of a document K containing the knowledge necessary to generate the answer. These sequences can be appended to the context and provide additional information during the generative step. The output response probability becomes thus \(P(R | C, R_{candidate})\) or \(P(R | C, K)\) depending on the approach: retrieve and refine or knowledge-grounded.

The generative model can exploit the candidate response \(R_{candidate}\) to yield an utterance showing a more vibrant language, typical of human-generated responses [145]. Similarly, the generative model can exploit the additional knowledge K to ground the response on actual information instead of “hallucinating” it using the knowledge embedded in its weights [145]. Notice that the two approaches are not alternative: \(P(R| C, K, R_{candidate})\).

Despite retrieve-and-refine models showed to improve over vanilla generative models [198], training these models is difficult since they do not always learn to exploit the additional information in a proper way [145]. However, solutions have been proposed to cope with this issue by randomly alternating the retrieved response and the target one [145]. The first version of BlenderBot [145] is an example of correctly trained retrieve-and-refine model.

An important point to keep in mind is that at training time it is necessary to have the retrieved sequence. While this requirement does not represent an issue for response retrieval models –since the retrieval model can be trained on a generic conversation corpus (even the same one)– this is an issue for knowledge retrieval systems. In fact, in this case, it is necessary to have a gold knowledge, retrieved by humans, to be sure that the selected segment of text is relevant to the context and contains information necessary to generate the response.

4.2 Decoding

All the presented models do not directly output a sequence of tokens, but rather a sequence of probability distributions over the vocabulary \(\mathcal {V}\). These sequences need to be decoded somehow to extract the actual token sequence composing the response. There are multiple ways to accomplish this task.⁸

Since the Seq2Seq are trained to maximise the probability of the correct response, the best-generated response should be the most probable one (which maximises the expression in Equations (34) and (35)). Ideally, we would like to use an exaustive search algorithm to decode the output probabilities and retrieve the most probable sequence. In practice, this is unfeasible due to the size of the output space (the branching factor is the size of the vocabulary \(\mathcal {V}\)). Thus, approximations exists that allow exploration of the output space [72].

In general, we distinguish between deterministic and stochastic decoding. In deterministic decoding the next token is sampled according to some fixed rule and always yields the same output. In stochastic decoding the next token is sampled randomly from the distribution predicted by the language model. These techniques can be combined with approaches to explore multiple possible response sequences.

Fig. 22.

4.2.1 Single Sequence.

The most straightforward decoding approach is called greedy decoding where, at each step, the most probable token is fed back to the model, as in Figure 22(a). However, simple greedy decoding usually leads to dull responses (e.g., “I don’t know”) that often contain degenerated text full of repetitions.

Alternatively, it is possible to resort to multinomial sampling. The idea is to sample a token according to the output probability distribution over the vocabulary \(\mathcal {V}\), recur the token as the next response token, and then repeat the process iteratively until some stop criteria (e.g., the end-of-sequence token is sampled). This can help avoid the bland, and incoherent text sometimes yielded by maximisation-based decoding methods like greedy decoding [72]. The process can be repeated to sample multiple responses and retrieve the most probable one.

Some variations to vanilla sampling can be combined. The temperature re-scoring is applied on the logits (hence, before the exponential normalisation of the \(\mathrm{softmax}(\cdot)\)), dividing the logit scores by a temperature parameter \(\tau\), as in Equation (38). If \(\tau \lt 1\) the distribution gets more “peaky” and if \(\tau \gt 1\) the distribution gets more “smooth”. Usually, a \(\tau \lt 1\) is used to reduce the probability of more unlikely tokens being sampled [1]. Examples of application of temperature re-scoring to the output distribution can be found in Figure 23(a).

\begin{equation} P(x_t) = \mathrm{softmax}\left( {\left(\mathbf {W}_{LM}^\top \cdot \mathbf {h}_t\right)}/{\tau }\right)_{x_t} = \frac{\exp {\left( {\left(\mathbf {w}_{x_t}^\top \cdot \mathbf {h}_t\right)}/{\tau }\right)}}{\sum _{i = 1}^{|\mathcal {V}|}{\exp {\left( {\left(\mathbf {w}_i^\top \cdot \mathbf {h}_i\right)}/{\tau }\right)}}} . \end{equation}

(38)

Another common sampling approach is filtering on the best candidates before sampling. In this sense, two complementary approaches exist: top-k and top-p sampling (the latter is also referred to as nucleus sampling). Top-k prescribes to consider only the most probable \(k \in \mathbb {N}^+\) tokens (where k is predefined) and sample among them according to their probability. Instead, top-p prescribes to consider the smallest set of tokens such that the sum of their probability is \(\ge p \in [0, 1] \subset \mathbb {R}\) (thus, yielding a variable number of possible tokens) [72]. Examples of application of top-k and top-p re-scoring to the output distribution can be found, respectively, in Figure 23(b) and 23(c).

Fig. 23.

Finally, a very recent solution is contrastive search [178]. This deterministic approach combines top-k with a re-scoring function: the next token is selected among the first top-k that maximises a weighted score combining the token score and a degeneration penalty. The score is computed as in Equation (39), where \(\alpha \in [0, 1] \subseteq \mathbb {R}_0^+\) is a tunable hyperparameter, and \(s(\cdot)\) is the cosine similarity.

\begin{equation} x_t = \mathop{\arg\max}_{\hat{x}~\in ~\text{top-}k(p(x|x_{t^{\prime }\lt t}))}{\lbrace (1 - \alpha) \cdot P(\hat{x}|X_{t^{\prime }\lt t}) - \alpha \cdot \max _{t^{\prime }\in [1,t) \subseteq \mathbb {N}}{s(\mathbf {h}_{\hat{x}}, \mathbf {h}_{t^{\prime }})}\rbrace } . \end{equation}

(39)

With anisotropic models⁹ this decoding scheme allows to generate very diverse responses. In fact, at each step the decoding scheme selects the token with the most different latent representation among the k most probable tokens.

4.2.2 Multiple Sequences.

When decoding the response sequence it can be useful to generate multiple candidates and then select among them. An exhaustive search is not feasible, yet there are some available approximations.

The deterministic or stochastic decoding schemes for single sequences can be combined with an orthogonal technique: beam search decoding. Beam search decoding is a heuristic that considers multiple candidates while decoding, yielding multiple sequences in output. The basic idea is to keep track of a fixed number of candidates (beam) and carry on their decoding, in parallel. The decoding process is advanced for each candidate at each step, keeping the best new elements. All these new candidate sequences are then re-ranked on the new cumulative probability. Only up to the first n sequences (with n being the so-called beam size) are kept, as depicted in Figure 22(b).

A simpler alternative is to sample and re-rank. The probabilistic language model is used to sample independently multiple sequences. These sequences can be re-ordered according to some metric or scoring function. For this purpose, leveraging a multi-objective model, which can use the retrieval head to score the candidates, can be helpful.

In this re-ranking settings, it is worth to mention the Maximum Mutual Information scoring approach [91, 216]. The idea behind this technique is to employ a separate backward model to re-score generated candidates. The backward model tries to predict the probability of observing the context (or, at least, the latest utterance) given the response. In this way, keeping the response that maximises the context probability would help filter out bland and inconsistent responses. Notice that this technique, despite being useful, requires the training of a completely separate model that must be used to re-rank all the generated responses, increasing the overall computational cost.

4.3 Conditioning and Prompting

Fig. 24.

Conditioned (or controlled) generation is a relevant feature to include in a chatbot. In fact, these agents can benefit from following some high-level behaviours. Thus, conditioned generation is fundamental to controlling the generated response’s content and style. This control, ideally, enables selecting different aspects of the generated text, from the simpler ones, like the emotion or the dialogue act, to more complex ones, like the agent’s persona or knowledge ground. Apart from the possible architectural changes and extensions to the Seq2Seq agent, to work correctly at inference time, conditioning may require one or more additional modules to predict the expected content of the response.

In general, neural models for conditioned sequence generation use a set of desired attributes \(A \equiv \lbrace a_1, \ldots , a_k\rbrace\) to output the posterior probability of a sequence given these attributes. Maximising the output posterior probability will ideally lead to a sequence showing the properties encoded through the attributes. In the specific case of chatbots, the posterior probability is conditioned on the context, too (see Equation (40)). This target posterior probability distribution to approximate is similar to that of hierarchical models.

\begin{equation} P(R|C,A) = \prod _{i = 1}^{n}{P(r_i|C, A, R_{i^{\prime }\lt i})} . \end{equation}

(40)

There are multiple approaches to train a conditioned generative chatbot. The most straightforward solution is to train the conditioning and the generative aspects, together, in an end-to-end manner. There are two major approaches of this kind: either prepend the conditioning attributes to the sequence to generate or concatenate the conditioning embeddings to the hidden text representation. The former approach, depicted in Figure 24(a), instructs the generative model through the input sequence. It is possible to insert “special” tokens representing the attributes (e.g., emotion or dialogue act), which will be converted into embeddings by the input layer [99, 117, 211]. Alternatively, it is possible to use entire text sequences describing the agent’s persona or the knowledge necessary to generate the answer (these text sequences are usually marked through additional type tokens or “special” start and end tokens) [50, 99, 202], TransferTransfo [202] and CAiRE [99] use prepending to add persona-grounding, while BlenderBot [86] uses it for knowledge grounding. The latter approach directly alters the hidden state of the model, as depicted in Figure 24(b). The embedding of the desired attribute (or attribute combination) is concatenated (or summed) to each of the initial word (or token) embeddings coming from the input layer, the resulting “extended” embedding is then used to feed the hidden transformations [92, 220]. For example, XiaoIce [220] uses the concatenation approach to condition the generation towards the desired attributes prescribed by its cognitive and emotional models.

Some solutions suggest using a mixture of generative models, where each one of them is specialised on a specific attribute (or attribute combination) [98] (see Figure 24(c)). For example, when conditioning on emotions, we will have a decoder for “happiness” (i.e., trained only with samples labelled as “happiness”), a decoder for “anger” (i.e., trained with samples labelled as “anger”), and so on. Depending on the selected emotion, the corresponding decoder is used to generate the conditioned response. Consequently, during training, only the weights of the decoder corresponding to the target attribute are updated on the given response. This mixture approach does not scale up well, since the generative portion of the network must be replicated for each possible attribute combination, and is not compatible with non-categorical conditioning attributes (e.g., persona or knowledge). The chatbot MoEL [98] uses this approach.

An alternative approach to end-to-end conditioning is to just learn the adaptive (conditioning) model. The key idea is to decouple the generative part from the conditioning one [106, 212]. In this way, the generative core chatbot has to be trained only once (which is the most onerous operation), and the conditioning modules can be trained and freely plugged into the core dialogue model, as in the architecture depicted in Figure 24(d). These adapter layers [11, 74] (the purple blocks in Figure 24(d)) are responsible of altering the hidden representations of the network to achieve the conditioning. Notice that, given the pre-trained generic chatbot, this technique requires training a smaller number of parameters than standard fine-tuning. In fact, this approach requires training only the conditioning layers, instead of fine-tuning the entire model. The Plug-and-Play conversational model [106] uses this conditioning technique. The advantage is that, by design, the adapter layers fewer parameters than the rest of the network. However, it is hard to achieve results as good as those of fine-tuning [137].

Other solutions involve using an attribute discriminator to carry out weighted decoding. It is possible to either re-score the output sequences using the attribute discriminator [155] or apply more sophisticated techniques of this kind as PPLM [41]. The advantage of these techniques is that they allow employing pre-trained chatbots and discriminative models to produce conditioned generative dialogue agents without needing specific training. The disadvantage is that the more complex decoding scheme slows down the inference. PPLM powered chatbots [2, 106] are more sensitive to this issue since they need to alter, step-by-step, the hidden state of the model in forward-backwards manner, making this approach hardly usable in real-time scenarios.

All these models rely on conditioning approaches and need, to be trained, conversational corpora labelled with the desired attributes. This requirement usually gets problematic due to possible unbalances among the attributes, or to unsatisfactory corpora size. To overcome this limitation it is possible to add synthetic labels, using a discriminator [171].

In the last three years, the development of large language models like GPT-3 has introduced the concepts of prompting and few-shots learning, sometimes referred to as in-context learning [22, 120, 136, 148, 194], to condition the output of the model. The idea behind prompting is to use natural language to describe the desired behaviour or output as initial part of the input context, and then leverage model completion capabilities to generate an output that responds to the requests. For example¹⁰:

For a sufficiently complex Seq2Seq language model trained on a large amount of data this is sufficient to start generating the response utterance.

In a sense, this prompting approach is a generalisation of the “prepending” approach to conditioning, where the attributes are described in natural language and may affect the entire behaviour of the agent, rather than a single response. Moreover, introducing examples of the desired behaviour on similar inputs as part of the context helps the model extrapolate and carry out the desired task. This is why we talk about few-shots: shots are the examples in the context, paired with the prompt to obtain the desired behaviour [22].

In the context of chatbots, prompting and few-shot learning can be used to instruct the model to follow a specific behaviour. Examples of models using these appraoches are ChatGPT [118], Bard [63], Bing Chat [111]. models like ChatGPT or Bing Chat rely on an additional ad-hoc training for chatting. In the case of the language model Gopher, instead, generic dialogue pre-training and prompting were sufficient to produce a language model capable of having conversations at the same level its ad-hoc trained version for dialogue [136]. Dropping the additional task-specific training may be an important key towards the future developments, which we discuss in Section 5.2.

5 Conclusion

In this section, we summarise the content of this article and we report our considerations on the possible future directions and open issues connected to generative Seq2Seq chatbots.

5.1 Concluding Remarks

In this article, we explained how to employ neural Seq2Seq architectures to build and train open-domain conversational agents (or chatbots). This kind of architecture allows the implementation of generative data-driven chatbots, which show more robust adaptive capabilities than retrieval based ones.

These Seq2Seq models work on text represented as a sequence of embeddings, and leverage recurrent layers, or Transformer layers with attention, to process the input (the context) sequence and yield the output (the response) sequence. Each time step output actually contains a probability distribution over the vocabulary words. Such time step distributions need to be decoded to obtain the final tokens composing the response.

The Seq2Seq models are mainly trained to maximise the probability of the target response in the sample conversation, given the context (i.e., the preceding turns). Many corpora exist for open-domain conversation, including large corpora for pre-training and smaller corpora for fine-tuning. Once trained, these models can be evaluated either through automatic metrics or by means of human raters. Moreover, these Seq2Seq can be further optimised some human feedback for a continuous improvement. Additionally, it is possible to join one of the available competitions to test the quality of the conversational agents, comparing it with others.

5.2 Future Directions

The landscape of NLP, in general, and the landscape chatbot develpement, in particular, are evolving quickly. ChatGPT [118], Bard [63], Bing Chat [111], and so on. have shown how impressive a large language model based on Seq2Seq architecture and trained on a massive amount of dialogues can be.

In the near future, we expect to see a further scaling of these models, to be more complex and to train on more data. Moreover, we expect these Seq2Seq models to become end-to-end multimodal models, thus allowing multiple forms of inputs and outputs, like images or audio, to be mixed with regular text. Multimodal agents are already being developed, like XiaoIce [220], Gato [141], or GPT-4 [120], but these functionalities are not always deployed with the chatbot (for example ChatGPT with GPT-4 backend, to this day, does not support image inputs yet despite GPT-4 does).

As models scale up, we expect also a convergence with task-oriented systems. Initial works in this direction adopted Seq2Seq architectures to train generative models for task-oriented conversational agents [23, 88, 102]. The idea was to fine-tune pre-trained language models on task-oriented dialogues data [24] adding, for example, specific tokens to represent the additional pieces of information for the dialogue agent or prepending text containing the description of the agent’s belief state (i.e., the dialogue state according to the agent, usually expressed as a set of slot-value pairs). Tracking of the belief state is delegated to additional pieces of the Neural Network. These models generate a response starting from the description of the belief state and the context of the conversation. Currently, the general purpose capabilities of large language models used as chatbots are slowly allowing to use the same model for both open-domain and task-oriented dialogues, guiding the behaviour with the initial prompt description. Moreover, some services are starting to develop compatible plugins to exploit these chatbots, as happened with ChatGPT [119].

Finally, the improvement of the underlying language models is opening the possibilities toward more and more general agent whose behaviour is controlled through natual language prompts. Thus, eventually, we will drop the adoption of fine-tuning and objective optimisation in favour of prompting as a form of high-level programming of the chatbot, as it is (partially) happening with ChatGPT or Bard for example. Gopher already showed that with the correct pre-traing and chat prompt the generated dialogues are indistinguishable from those obtained after fine tuning from a human perspective [136].

5.3 Open Problems

Apart from the functional improvements, we expect also the agents to overcome their limitations. In fact, there are still some open technical problems and ethical issues to solve. These issues are inherited from the Seq2Seq language models underlying these chatbots [195].

Seq2Seq chatbot are prone to knowledge hallucination. This problem is mostly due to the lack of proper knowledge grounding in favour of weights memorisation(i.e., relying on the knowledge embedded in the model’s weights) that characterises the currently developed Seq2Seq bases chatbots (e.g., ChatGPT). Knowledge grounding introduces the possibility of explaining and supporting the generated responses, but increases the system complexity. Moreover, despite there is the possibility of implementing the agent to exploit knowledge sources, there still can be faults in the generated content(the benchmarks in this sense do no report perfect scores [128]). Strictly related to this aspect, is the lack of sound logical reasoning that leads these models to generate nonsensical content or leading to the wrong conclusion even if starting from the correct premises (e.g., Gopher at times respondes very confidently with wrong answers [136]).

Given the quality of the generated text and the conversations these chatbots are capable of managing, the prolonged use of these tools, possibly combined with avatars and other embodyment techniques [122], may result in overconfidence towards the agent. This situation can be exploited to manipulate opinion and spread misinformation [195].

These knowledge hallucination and misinformation issues are connected to the explainability and interpretability of the results. Being deep Neural Networks an opaque tool, understanding how and why a specific output is generated is not always easy. Lately, Human-Computer Interaction, the are comprising all the technologies that interface directly humans and computers, has become strongly connected to this aspect, leading towards the concept of Explainable AI (XAI) [147]; the idea is to make AI more transparent rather than a black box tool, and subsequently ease its adoption.

Another noteworthy issue is that of bias and fairness present in the data and the generated content. Being deep Neural Networks based on a frequentist approach to inference, they are prone to overfitting. As a result, if trained on biased/unfair data, these models can possibly produce biased, unfair content such as offensive, discriminative, and hurtful text. Lately, there has been a lot of attention to quantify and remove bias from data and models [20, 45, 61]. Currently, deployed Seq2Seq chatbots (like ChatGPT) use automatic system to detect whether it is generating this kind of content and either avoid it or add a disclaimer to inform the user. However, this detection system are not foolproof.

Finally, given the current computational requirements imposed by these large language models, the access to these technologies for development is limited to very complex and costly cloud deployments or web-based APIs. Eventually, we would expect to be able to get an easier access to the underlying Seq2Seq model to so that everyone could benefit from them and can customise the chatbots to their needs.

Footnotes

This is because, usually, the frameworks to implement DNNs require input and output to be encoded through tensors, and thus, along any given dimension of the tensor, all the elements must have the same length.

Here, with tensor we mean a generalised, multidimensional array.

Models for STS are used to compute a score about the semantic similarity of two sentences.

⁴

Models for NLI compare two sentences to understand if they are consequential, contradictory or independent.

⁵

Knowledge graphs are graph-based representations of complex structured and unstructured information.

⁶

Speaker persona is the description of the unique characteristics, such as background and speaking style, that characterise an individual.

⁷

This blank token can correspond to an embedding of all zeros, a learnt mask of some random vector.

⁸

In the following we refer to an autoregressive decoder, but the same concepts apply to other models.

⁹

In this context, anisotropy is a measure of how well embeddings are “spread” in their hyperspace. Usually, language models with more than 500M parameters are naturally anisotropic [178].

¹⁰

The example is adapted from Gopher.

¹¹

The KL divergence is a common way of measuring the distance of a probability distribution from a reference one. In our case \(q(\cdot)\) is the reference distribution and \(p(\cdot)\) is the distribution to be measured.

¹²

In this formulation we considered an autoregressive model, but the formula is extensible to any model.

¹³

Often, when computing these metrics, articles and other similar stopwords are removed.

¹⁴

https://www.facebook.com.

¹⁵

https://telegram.org.

¹⁶

https://irclogs.ubuntu.com.

¹⁷

https://www.reddit.com.

¹⁸

https://pushshift.io.

¹⁹

https://twitter.com.

²⁰

https://www.opensubtitles.org/.

²¹

https://dstc10.dstc.community.

²²

https://weibo.com/.

A Training and Evaluation

In this section, we explain how Seq2Seq chatbots can be trained and evaluated. As for the previous section, hereafter we refer to models using auto-regressive generation. However, the formulae can easily be extended to other models.

A.1 Training Approaches

To produce models that generate fluent responses, Seq2Seq chatbots must be trained with a large amount of data. The training process can be carried out in multiple steps, from pre-training [134, 216] to fine-tuning [210], following a so-called curriculum learning approach [15]. Finally, following recent trends, we introduce also objective-based training, which is a reinforcement learning-based fine-tuning.

We talk of curriculum learning because the general idea is to start training the chatbot on large text corpora, even from a different domain, and then iteratively refine (fine-tune) the chatbot, shifting it towards the target domain and behaviour. At each different training step, the complexity of the dialogue task is increased and the samples get closer to the those of the target domain. For example, one can start from pre-training on generic text from books, which account for very large data sets (to learn linguistic structure), do a first fine-tuning on conversations scraped from Twitter (to learn dialogue structure) and finally fine-tune again on empathetic dialogue (to learn empathetic behaviour in conversations). Empirically, this curriculum learning approach was shown to yield better results than training directly on the target data [15].

Fig. 25.

A.1.1 Pre-training.

The pre-training step is fundamental to obtaining a fluent chatbot [1, 145, 216]. The idea is to leverage a large collection of unlabelled text to perform the first training of the Seq2Seq model. At this training step, the Neural Network learns useful hidden representations and fundamental linguistic structures.

The training data does not necessarily need to be conversations: a generic corpus for language modelling is usually sufficient (e.g., the Toronto Book Corpus [222] or C4 [51]). However, leveraging a large, generic conversation corpus directly yields a usable chatbot, like Meena [1] or BlenderBot [169] (the latter is then refined via objective optimisation, see Section A.1.3). It is also possible to mix these two approaches, performing a first pre-training on generic text data and then a further pre-training on a large, generic conversation corpus, like in the case of DialoGPT [216].

Note that such a “large, generic conversation corpus” usually does not contain the conversation typologies that the chatbot will need to learn at the end of the whole training process. So, the goal is to obtain a good initialisation from which it will be simpler to refine the chatbot on the desired (typically smaller) specific conversational corpus.

The language modelling head of the chatbot tries, at each step, to predict the probability distribution of the next token. Thus, the main loss function to minimise is the negative log-likelihood of the next token into the sequence, given the preceding ones. In fact, during training, the decoding is guided by the reference response, and thus it is possible to compute the average negative log-likelihood loss as reported in Equation (41) and depicted in Figure 25.

\begin{equation} \mathcal {L}_{DLM}(C, R; \vartheta) = -\frac{1}{|R|}\sum _{i = 1}^{|R|}{\ln {P_{DLM}\left(r_i\big |C, R_{i^{\prime }\lt i}; \vartheta \right)}} , \end{equation}

(41)

where \(C = \langle c_1, \ldots , c_{|C|} \rangle\) is the input context, \(R = \langle r_1, \ldots , r_{|R|} \rangle\) is the output response, and \(\vartheta\) represents the model parameters.

This approach is valid for simple latent hierarchical models, too. For variational latent hierarchical models, the loss function changes to optimising the Exponential Lower BOund (ELBO) [174] (see Equation (42)):

\begin{equation} \mathcal {L}_{ELBO}(C, R; \vartheta) = -\frac{1}{|R|}\sum _{i = 1}^{|R|}{\ln {P_{DLM}\left(r_i\big |C, \mathbf {z}, R_{i^{\prime }\lt i}; \vartheta _h, \vartheta _{lm}, \vartheta _q \right)}} +\mathrm{D}_{KL}[q(\mathbf {z}|C, R; \vartheta ; \vartheta _h, \vartheta _q)||p(\mathbf {z}|C; \vartheta _h, \vartheta _p)], \end{equation}

(42)

where C and R are defined as before, \(\mathbf {z}\) is the latent variable, sampled from the posterior latent distribution \(q(\cdot)\), and \(\mathrm{D}_{KL}[q(\cdot)||p(\cdot)]\) is the Kullbak-Leibler divergence (or KL divergence¹¹) between the prior latent distribution \(p(\mathbf {z}|C)\) and the posterior latent distribution \(q(\mathbf {z}|C,R)\), defined as in Section 3.2.2. Moreover, \(\vartheta _h\), \(\vartheta _{lm}\), \(\vartheta _p\), and \(\vartheta _q\) are,respectively, the parameters of the hidden transformations, the language model head, the prior latent model, and the posterior latent model. Notice that the next token in the response also depends on \(\mathbf {z}\), and not only on the context and the previous response tokens.

These models suffer from an issue called KL vanishing [218], where the KL-divergence goes to zero cancelling the contribution of the latent. In practice, the variational model degenerates and always predicts the same latent code, making the latent code uninformative and thus useless for the generation process. However, various solutions help prevent this collapse [56, 124, 217, 218, 221]. Plato [9, 10] is an example of variational latent hierarchical model pre-trained on conversations scraped from Reddit, it was also fine-tuned on some reference benchmarks (see Section A.1.2).

As mentioned before, this first pre-training step requires processing large text collections using a high amount of computational time and power. Thus, starting completely from scratch and doing the entire pre-training is not always viable. So, to cope with this computational power demand, it is possible to rely on pre-trained models made available by large companies (e.g., OpenAI, Google, Microsoft, Facebook). For example, the Transformers package from Hugging Face [201] gives easy access to many pre-trained Transformer models.

A.1.2 Fine-tuning.

Fine-tuning (together with transfer learning [210]) has become a fundamental step (especially in NLP) for taking advantage of deep pre-trained models and achieving impressive performances, even on small corpora [46, 134, 135, 137]. This step is part of the curriculum learning process currently used to train chatbots: the model is first trained on a large text corpus (not necessarily conversations, but also books or web scrapes), then it is interatively refined on data samples coming from domains more and more similar to the target one (e.g., corpora for knowledge grounded conversations or empathetic conversations) [86, 99, 168, 202]. In this way, it is possible to take advantage of the initialisation given by the pre-training, which offers a good initial hidden representation encoding the main linguistic features, to achieve good results in an efficient way.

From a practical perspective, the main loss function to optimise for training the network (i.e., the negative log-likelihood of the next token) does not change: only the underlying corpus changes. However, it has become common practice to add further loss functions to improve the fine-tuning process. In the case of chatbots, the language modelling loss function (as defined in Equation (41)) is often mixed with a retrieval one, as in the hybrid architectures described in Section 4.1.1, like TransferTransfo [202] or CAiRE [99]. During training, the usual negative log-likelihood is paired with a binary contrastive loss: the model is presented with samples where the correct response is substituted with a distractor response, following the given context [27, 94, 202]. This approach is used to help chatbots generate better responses [202]. The most straightforward implementation of this approach is to use a separate classifier on top of the decoder model. In such a case, the loss becomes like the one in Equation (43), where \(\mathcal {R}_{distractor}\) is the set of distractor responses, \(\alpha\) is the hyper-parameter used to control the relative importance of the losses, and \(\mathcal {L}_{CLS}(\cdot)\) is the contrastive binary cross-entropy loss to train the retrieval head, defined in Equation (44).

\begin{equation} \mathcal {L}_{multi-objective}\left(C,R,\mathcal {R}_{distractor}; \vartheta \right) = \alpha \mathcal {L}_{LM}(C,R) + (1 - \alpha) \mathcal {L}_{CLS}\left(C, R, \mathcal {R}_{distractor}\right) , \end{equation}

(43)

\begin{equation} \begin{split} \mathcal {L}_{CLS}\left(C, R, \mathcal {R}_{distractor}; \vartheta \right) &= -\ln {P_{CLS}(R \in \mathcal {C}_{correct}|C)} -\frac{1}{| \mathcal {R}_{distractor}|}\sum _{R^{\prime } \in \mathcal {R}_{distractor}}{\ln {P_{CLS}(R^{\prime } \notin \mathcal {C}_{correct}|C)}} =\\ &= -\ln {P_{CLS}(R \in \mathcal {C}_{correct}|C)} -\frac{1}{| \mathcal {R}_{distractor}|}\sum _{R^{\prime } \in \mathcal {R}_{distractor}}{\ln {\left(1 - P_{CLS}(R^{\prime } \in \mathcal {C}_{correct}|C)\right).}} \end{split} \end{equation}

(44)

Another approach is to leverage unlikelihood training, mixing the usual token-wise loss with a contrastive loss [27, 94]. The model is trained to maximise the probability of observing the tokens of the correct response R and minimise the probability of observing the tokens of distractor responses taken from a set of distractors \(\mathcal {R}_{distractor}\), as in Equation (45). This second step is done through the unlikelihood loss from Equation (46).

\begin{equation} \mathcal {L}_{contrastive}\left(C, R, \mathcal {R}_{distractor}; \vartheta \right) = \alpha \mathcal {L}_{LM}(C, R) + (1 - \alpha) \mathcal {L}_{UL}\left(C, \mathcal {R}_{distractor}\right) , \end{equation}

(45)

\begin{equation} \mathcal {L}_{UL}\left(C, \mathcal {R}_{distractor}; \vartheta \right) = \frac{1}{|\mathcal {R}_{distractor}|} \sum _{R^{\prime } \in \mathcal {R}_{distractor}}{-\frac{1}{|R^{\prime }|}\sum _{i = 1}^{|R^{\prime }|}{\ln {\left(1 - P_{DLM}\left(r_i^{\prime }\big |C, R_{i^{\prime }\lt i}^{\prime }; \vartheta \right)\right)}}} . \end{equation}

(46)

Finally, we want to underline that this fine-tuning step is also useful for introducing additional features to the agent, like conditioning. In general, conditioning affects how the input is presented to the model or which parameters to update during the training iterations (see Section 4.3). Thus, it is helpful to train additional predictive or discriminative parts of the chatbot during fine-tuning. For example, conditioned models, which require the attributes describing the response to generate the answer, would need to train an additional model head to predict such attributes given the context. For example, a model generating responses conditioned on emotion would require an additional module to predict the emotion of the response. Such modules for conditioned language modelling, attribute prediction, and attribute recognition can be trained at once, summing all the losses together, this is the case of CAiRE [99] and EmpTransfo [211] for empathetic dialogues, they both predict the target emotional status to condition the response generation.

A.1.3 Objective Optimisation.

Despite open-domain dialogue not having a clear and well defined objective function to optimise, it is possible to use reinforcement learning algorithms [180] to further fine-tune a chatbot to pursue one or more objectives through dialogue [149, 158]. In fact, training a dialogue agent to optimise some objective is more common for task-oriented agents [57].

Most solutions relied on policy gradient algorithms, like REINFORCE [200], and handcrafted objectives [149, 158, 166, 220]. They are applicable either to base models (defining a policy over the tokens composing the response, altering the language model) [93, 166] or to hierarchical models (defining a policy over high-level attributes to condition the response generation or on the hidden latent representation) [149, 158, 220]. When used with base models, this kind of approaches may break the generative capabilities of the model and lead to disfluent text, there are possible solutions to deal with these issues [166].

Usually, policy-gradient approaches applied to dialogue generation consider an entire conversation as a sequence of utterances and extract all the avaialble context-response pairs:

\begin{equation} X = \langle U_1, \ldots , U_i, \ldots , U_{n_X} \rangle \rightarrow \langle (C_1, R_1), \ldots , (C_i, R_i), \ldots , (C_{n_X}, R_{n_X}) \rangle , \end{equation}

(47)

where \(U_i \in \mathcal {V}^{|U_i|}\) is a sequence of tokens representing a turn in the dialogue, \(C_i = \langle U_1, \ldots , U_{i - 1} \rangle\) is the context associated to the ith turn in the dialogue and \(R_i = U_i\) is the ith turn in the dialogue. Each response \(R_i\) has an associated reward \(r_i\), so that given the discount factor \(\gamma \in [0, 1] \subseteq \mathbb {R}\) of the Markov Decision Process (MDP) [93, 166, 180] associated with the reinforcement learning problem, we can compute the discounted cumulative future reward \(G_i\) of the response \(R_i\) as in Equation (48)

\begin{equation} G_i = \sum _{k = 0}^{\lfloor {(n_X - i)}/{2} \rfloor }{\gamma ^{k} \cdot r_{i + 2k}} . \end{equation}

(48)

Note that in some cases, the discounted cumulative future reward is standardised over the dialogue to enforce stability of the policy gradient algorithm. Given a dialogue X, the objective \(J_\pi (\cdot)\) to maximise to train the dialogue policy is thus defined in Equation (49)

\begin{equation} J_\pi (X; \vartheta) = - \frac{1}{n_X} \cdot \sum _{i = 1}^{n_X}{\left(- G_i \cdot \ln {P_{DLM}(R_i|C_i; \vartheta)}\right)} . \end{equation}

(49)

To prevent the collapse of the underlying language model, the objective is mixed with the usual language modelling loss, as in Equation (50) [166]

\begin{equation} J_{mixed}(X; \vartheta) = \alpha J_\pi (X; \vartheta) - (1 - \alpha) \mathbb {E}_{(C_i, R_i) \in X}\left[\mathcal {L}_{DLM}(C_i, R_i; \vartheta)\right] , \end{equation}

(50)

with \(\alpha \in [0, 1] \subseteq \mathbb {R}\) being the hyper-parameter used to control the relative importance of the two objectives

These reinforcement learning-based solutions are particularly suitable for empathetic chatbots or empathetic chatbots. In fact, as long as a measure of empathy is provided, it is possible to refine the agent to maintain the open-domain dialogue properties while steering the responses towards more empathetic ones. For example, XiaoIce [220] optimises the expected Conversation-turns Per Session to be more social and empathetic, which is an handcrafted metric that should correlate with an empathetic behaviour. Another adopted handcrafted feature is the expected user sentiment, with the objective of eliciting a positive sentiment in the other person during the dialogue [166]. Instead of handcrafting the objective, some solutions relied on a learnt empathy measure to maximise [163, 164].

Rather then defining an handcrafted objective or learning a specific reward, it is possible to rely on human feedback to improve the agent. BlenderBot [169] proposed a solution of continuous improvement where each response can get a feedback with different level of granularity [207]. These feedbacks can be used to learn either a re-ranking system or a reward function to optimise.

ChatGPT [118] represents the current state of the art of open-domain chatbots. It was trained starting from a large language model and then refined through reinforcement learning using a Proximal Policy Optimisation (PPO) algorithm [152]. The model starts from a supervised initialisation and is iteratively updated to optimise a reward function. The reward function is learned by mimicking human ranking on possible responses to a prompt.

A.2 Evaluation Approaches

Two main approaches exist to evaluate generative conversational agents: automatic or human-based [100, 172, 187]. The former is based on automatic metrics that allow a quantitative and objective evaluation of the conversational agent. The latter, instead, is based on subjective human evaluations. Since there is no objective metric capable of capturing all the nuances of an open-domain conversational agent, the task of evaluation is often delegated to humans.

A.2.1 Automatic Evaluation.

There are several metrics used to evaluate chatbots. The most popular automatic metric is undoubtedly the perplexity (\(\mathrm{PPL}\)). \(\mathrm{PPL}\) of generative dialogue models is defined as in Equation (51), where, as usual, C is the context sequence and R is the response sequence.¹² It measures how well a probability distribution predicts a sample. Recent studies found a correlation between \(\mathrm{PPL}\) and human opinion on different aspects of a conversation [104].

\begin{equation} \mathrm{PPL}(C,R) = \exp {\left(- \frac{1}{|R|} \sum _{i = 1}^{|R|} {\ln {P(r_i|C, R_{i^{\prime }\lt i})}}\right)} . \end{equation}

(51)

Other popular metrics are the next token accuracy or the \(\mathrm{F}_1\)-score. Note that, despite being defined for the retrieval domain, these metrics also apply to generative models.

Next token accuracy is computed by means of guided decoding. At each step, given the preceding response tokens, the probability distribution of the next token is computed, and the most probable next token is matched against the correct one from the corpus. The accuracy is then the ratio between correct predictions and total response length.

In the case of \(\mathrm{F}_1\) score, given a target sequence \(R = \langle r_1, \ldots , r_i, \ldots , r_{|R|} \rangle\) and a generated sequence \(\widehat{R} = \langle \hat{r}_1, \ldots , \hat{r}_j, \ldots , \hat{r}_{|\widehat{R}|} \rangle\), it is possible to define the precision \(\mathrm{Pr}\) as the ratio of the number of common tokens between R and \(\widehat{R}\) and the number of elements in the generated sequence (see Equation (52)). The recall \(\mathrm{Re}\) is defined as the ratio of the number of common tokens between R and \(\widehat{R}\) and the number of elements in the target sequence (see Equation (53)). Given these definitions, the \(\mathrm{F}_1\) can be computed, as usual, as the harmonic mean of \(\mathrm{Pr}\) and \(\mathrm{Re}\).¹³

\begin{equation} \mathrm{Pr} = \frac{|\widehat{R} \cap R|}{|R|} , \end{equation}

(52)

\begin{equation} \mathrm{Re} = \frac{|\widehat{R} \cap R|}{|\widehat{R}|} , \end{equation}

(53)

\begin{equation} \mathrm{F}_1 = 2 \cdot \frac{\mathrm{Pr} \cdot \mathrm{Re}}{\mathrm{Pr} + \mathrm{Re}} . \end{equation}

(54)

Note that these definitions of \(\mathrm{Pr}\), \(\mathrm{Re,}\) and \(\mathrm{F}_1\) consider only unigrams (i.e., individual tokens in the sequences). However, we can extend the definitions to consider n-grams with \(n \ge 1\) (i.e., sub-sequences of n consecutive tokens), like bigrams (\(n = 2\)), trigrams (\(n = 3\)), and so on. Thus, we can introduce precision, recall and \(\mathrm{F}_1\)-score over n-grams: \(\mathrm{Pr}(\cdot ; n)\), \(\mathrm{Re}(\cdot ; n)\) and \(\mathrm{F}_1(\cdot ; n)\)

Other widely used metrics are \(\mathrm{BLEU}\) [123], \(\mathrm{ROUGE}\) [97], and \(\mathrm{METEOR}\) [8]. These metrics are defined using n-grams comparison.

\(\mathrm{BLEU}\), defined in Equation (55), is the geometric average of the precision computed with different n-grams (in fact, usually we talk of \(\mathrm{BLEU}{-}n\), where n is the maximum considered n-gram). Usually, the elements of the geometric average are weighted uniformly with \(w = {1}/{n}\). The \(\mathrm{BLEU}\) score if often scaled by a brevity penalty defined in Equation (56).

\begin{equation} \mathrm{BLEU}(R, \widehat{R}; n) = \mathrm{brevity{-}penalty}(R, \widehat{R}) \cdot \prod _{n^{\prime } = 1}^{n}{\mathrm{Pr}(R, \widehat{R}; n)^{\frac{1}{n}}} , \end{equation}

(55)

\begin{equation} \mathrm{brevity{-}penalty}(R, \widehat{R}) = {\left\lbrace \begin{array}{ll}1 & |\widehat{R}| \gt |R|\\ \exp {\left(1 - \frac{|R|}{|\widehat{R}|}\right)} & |\widehat{R}| \le |R| \end{array}\right.} . \end{equation}

(56)

\(\mathrm{ROUGE}\) is a set of metrics: \(\mathrm{ROUGE}{-}L\), \(\mathrm{ROUGE}{-}S\), and other weighted variants. All these variants are focused on the computation of the recall. \(\mathrm{ROUGE}{-}n\) is the n-gram recall. \(\mathrm{ROUGE}{-}L\) is recall computed as the ratio of the Longest Common Subsequence (LCS) of R and \(\widehat{R}\) and the length of the reference response, see Equation (57). \(\mathrm{ROUGE}{-}S\) considers the skip-bigram co-occurence statistics, where a skip-bigram is any pair of unigrams in their sentence order; see Equation (58), where the function \(\mathrm{skip}(\cdot ; n)\) maps the input sequence to its skip-n-grams.

\begin{equation} \mathrm{ROUGE}{-}L(R, \widehat{R}) = \frac{\mathrm{LCS}(R, \widehat{R})}{|\widehat{R}|} , \end{equation}

(57)

\begin{equation} \mathrm{ROUGE}{-}S(R, \widehat{R}) = \frac{|\mathrm{skip}(R; 2) \cap \mathrm{skip}(\widehat{R}; 2)|}{|\mathrm{skip}(\widehat{R}; 2)|} . \end{equation}

(58)

\(\mathrm{METEOR}\) relies on alignments, that are mappings between the reference output sequence and the generated output sequence (we provided an example of mapping in Figure 13).

The \(\mathrm{METEOR}\) score, defined in Equation (59), is a scaled, weighted harmonic mean of precision and recall. The \(\mathrm{F{-}mean}(\cdot)\) is the harmonic mean of precision and recall where recall is weighted nine times more than precision, see Equation (60). The scale is computed from the chunks penalty defined in Equation (61), where the \(\mathrm{chunk}\) function computes the fewest possible overlapping chunks between the target response and generated response. A chunk is a set of unigrams that are consecutive in the target output and the generated output.

\begin{equation} \mathrm{METEOR}(R, \widehat{R}) = (1 - \mathrm{chunks{-}penalty}(R, \widehat{R})) \cdot \mathrm{F{-}mean}(R, \widehat{R}) , \end{equation}

(59)

\begin{equation} \mathrm{F{-}mean}(R, \widehat{R}) = \frac{10 \cdot \mathrm{Pr}(R, \widehat{R}) \cdot \mathrm{Re}(R, \widehat{R})}{\mathrm{Pr}(R, \widehat{R}) + 9 \cdot \mathrm{Re}(R, \widehat{R})} , \end{equation}

(60)

\begin{equation} \mathrm{chunks{-}penalty}(R, \widehat{R}) = \frac{1}{2} \cdot \left(\frac{|\mathrm{chunk}(R, \widehat{R})|}{|R \cap \widehat{R}|}\right)^3 . \end{equation}

(61)

These last three metrics come from other areas of NLP (machine translation and automatic summarisation). Similarly to \(\mathrm{F}_1\)-score, such metrics try to evaluate the generated sequence by means of its overlap with a reference sequence. Although useful, these metrics not always correlate well with human judgement, and hence their use can be counterproductive (for example the results of the human evaluation in the ConvAI2 challenge reported as winner a model that performed poorly on overlap metrics [49]). In cases where the objective is to generate diverse responses that may steer away from the reference ones, the \(\mathrm{distinct}{-}n\) metric can be useful to evaluate the generated responses independently of the target responses [91]. \(\mathrm{distinct}{-}n\) is defined as the ratio between the distinct n-grams in the generated response and the length (in tokens) of the generated response, see Equation (62) (\(n{-}\mathrm{gram}(\cdot ; n)\) maps the input sequence of tokens to its sequence of n-grams). Usually the metric is computed with \(n \in \lbrace 1, 2, 3\rbrace\).

\begin{equation} \mathrm{distinct}{-}n(\widehat{R}; n) = \frac{|n{-}\mathrm{gram}(\widehat{R}; n)|}{|\widehat{R}|} . \end{equation}

(62)

Alternatively to these metrics, a recent trend consists in using learned metrics [58, 104, 181, 187], where a model is trained on a corpus of human ratings of conversations with dialogue agents. Provided a sufficiently accurate model, it should be possible to have a robust estimate of human evaluations without involving any human in the evaluation loop. Alternatively, it is possible to directly use the raw probabilities of large language models to evaluate dialogue quality automatically. In fact, the PPL of these models showed good correlation with human ratings on some dialogue aspects [116, 178].

A.2.2 Human Evaluation.

Usually, human raters are asked to rate specific aspects or compare available responses. The chatbot is then evaluated, for example, on the raters’ preferences or based on the number of times the chatbot performed not worse than the ground truth or other models. Evaluated aspects include [1, 91, 140, 154, 181, 214]:

–

General quality: whether the response is to be considered good;

–

Ease of answer: whether the response would be easy to respond to;

–

Fluency: whether the language seems accurate, or the response can be understood;

–

Relevance: whether the responses seem appropriate to the conversation;

–

Naturalness: whether the response maintains the natural conversation flow;

–

Consistency: whether responses contain contradictory information with respect to the context;

–

Diversity: whether the responses show lexical diversity;

–

Engagingness: whether the conversation is interesting disregarding fluency;

–

Sensibilness: whether the response, given the context, makes sense;

–

Sensitivity: whether the response is specific given the context.

In some cases, open-domain agents are also evaluated on empathy [140, 166], to understand whether the responses show understanding of the feelings of the other person. Notice that the raters can evaluate individual responses or can compare alternative responses.

Usually, human raters provide different evaluations. To help reconcile such evaluations, and understand whether the whole evaluation process was successful, an agreement coefficient among raters is usually calculated. The Kappa coefficient [48] is a popular choice, as it considers the possibility of the agreement occurring by chance.

Many formulas exist to calculate the Kappa (Cohen’s K, Scott’s pi, Fleiss’ kappa, Randolph’s kfree [139], etc.) but the basic idea is given by Equation (63):

\begin{equation} \mathrm{Kappa} = 1 - \frac{1-p_o}{1-p_e} , \end{equation}

(63)

where \(p_o\) is the observed agreement among raters, and \(p_e\) is the expected probability of chance agreement. Kappa=1 means perfect agreement, while \(\mathrm{Kappa} = 0\) means chance agreement. Kappa can be negative if there is no relationship between the ratings or the raters tend to give totally differing ratings. Note, however, that Kappa can be affected by bias and other issues and thus its results should be evaluated with care [48].

Human evaluation can be carried out in two ways: either in a static manner, where the human is given samples composed of context taken from a corpus and response(s) to evaluate, or in an interactive manner, where the human carries out a conversation with an agent, in real-time.

Static. Static human evaluation is the most straightforward approach. The idea is to start from fixed conversation contexts, like those available in the training set, and generate a response. This approach can be applied to compare multiple models. Human raters are given the context, followed by the generated response(s) and the ground truth response, and are asked to evaluate them. It is possible to adopt a so-called mechanical Turk of crowdsourced workers [49, 158] to have a larger (but less controllable) pool of raters, or rely on a more restricted (but possibly better) set of expert raters (e.g., three or five) [91, 92, 93, 166] to evaluate the models. Interactive. Differently, from static evaluation, the interactive approach poses some non-trivial challenges. First of all, the chatbot should be responsive in real-time. Given that they need to mimic human behaviour, a high response time may harm the engagingness. Secondly, it requires deploying the agent and hence developing a user interface, or at least leveraging some available APIs from social media (e.g., Facebook Messenger¹⁴ [220]) or messaging services (e.g., Telegram¹⁵ [49]). It this setting, the human can directly interact with the agent, making the evaluation more entertaining. This kind of approach can be helpful in collecting real-time feedback from the users. Such feedback can be leveraged to update the agent either at run-time or after the conversation.

B Corpora and Competitions

In this section, we describe the main corpora to train and evaluate Seq2Seq chatbots [157], and the most interesting competitions related to these models. Sometimes, the competitions revolve around one of the available corpora.

B.1 Corpora

Dialogue corpora to train generative chatbots can be of many kinds. We distinguish among three main groups: pre-training corpora, fine-tuning corpora and benchmarks or collections.

B.1.1 Pre-training.

Large conversational corpora obtained scraping social media (like Twitter) and forums (like Reddit) are often used to perform pre-training of conversational agents. The advantage of these corpora is that, due to their size –often these corpora include from hundreds of thousands to millions of conversation samples– they provide the Seq2Seq model with basic conversation capabilities that can be easily refined on more specific corpora that usually provide way fewer samples. The disadvantage is the lack of control over the content and the low quality of language in the conversations [157]. It is possible to train agents with impressive linguistic capabilities, leveraging such low-quality scraped conversations, but the amount of required data is huge (for example, Google used more than 300 GB of text to train Meena [1]). In the following, we briefly describe some of such corpora.

The Ubuntu Corpus [105] was extracted from the Ubuntu chat logs.¹⁶ It is an extensive collection of multi-turn dialogues for neural conversational agents. All conversations are between two human participants.

The Cornell Movie-Dialogs Corpus [40] was built from raw movie scripts and covers many topics and writing styles. Unlike other large corpora, it was manually curated and contains metadata about the speaking character and the movie.

Reddit ¹⁷ is a website containing social news, web content ratings, and discussions. Due to the conversational structure of the posts, it has been scraped to extract dialogue corpora [77, 216]. To avoid directly scraping the original website, a third party dump of Reddit is published on the pushift.io¹⁸ platform. Extracted conversations cover many topics, which reflect the ones of the sub-reddit they belong to. Conversations are multi-turn and multi-writer. Moreover, in some cases, it is possible to mine additional information without manual annotations, like the writer’s persona [212].

Twitter ¹⁹ was one of the first social media to have been employed for these kinds of tasks, due to the public availability of its content [1, 155, 175], and is still occasionally used to build corpora for training the agents. Conversations are built scraping the answers under posts. As for previous corpora, the conversations cover different topics and are multi-turn and multi-writer.

B.1.2 Fine-tuning.

Apart from the extensive conversational collections used for pre-training, many corpora exist to train open-domain dialogue agents. Unlike the corpora mentioned above, those for fine-tuning have fewer samples, are manually curated, and are rich in additional information, like labels and metadata. Labels may include emotion, topic, and dialogue acts, while metadata may include information to ground the conversation in:

–

speakers’ persona, the persona grounding is a set of short sentences describing the speaker’s profile.

–

knowledge, with knowledge grounding document we refer to a document chunk, manually selected to act as reference knowledge for providing grounding to a given response;

–

situation, the situation grounding is provided as a short description of the situation discussed in the dialogue or that is happening while the dialogue is taking place;

–

images, the image grounding is provided through one or more pictures that are object of the conversation (or at least mentioned in the conversation).

Early corpora of this kind, like CallHome [31], CallFriend [30] and Switchboard [60], were extracted from recorded telephone conversations. The Switchboard corpus also presents dialogue acts, leveraging the DAMSL notation [4].

A commonly found label, especially in recent corpora, describes the emotion. In fact, thanks to the rising interest in affective computing [129], many corpora have been labelled to include the speaker’s emotional status, often employing a categorical representation reflecting Elman’s taxonomy [53] (in some cases, the Elman’s six basic labels are extended to have a better granularity). Some of these corpora also label the speaker’s sentiment (i.e., positive or negative polarity).

Emotion labelled corpora include SEMAINE [109] and IEMOCAP [26], which provide video and audio recordings of the enacted conversations. IEMOCAP also includes emotion labels with a continuous notation [39]. Differently, the DailyDialogues [95] corpus was built crawling conversations from English learning websites and includes labels about emotion, dialogue act, and topic. The EmpatheticDialogues [140] contains simulated empathetic dialogues grounded in a situation presented through a descriptive sentence associated with the dialogue (this data set was used to train the empathetic chatbot CAiRE [99]). The MELD corpus [130] (previously known as EmoLines [75]) presents multi-speaker dialogues extracted from the transcripts of the “Friends” TV show, it also presents sentiment labels. The EDOS [197] corpus was recently introduced, it is the most extensive available collection of this kind: it contains around a million dialogues extracted from the OpenSubtitles²⁰ data base. Each turn is labelled with emotion or intent information.

Some corpora present specific kinds of dialogues. A relevant example is given by dialogues where one of the speakers tries to explain something to the other. In these corpora, it is possible to observe clarification questions that are hardly used by open-domain chatbots. Corpora of this kind include the Teacher-Student Chatroom [28, 29], where teachers answer students’ questions, ELI-5 [55], extracted from a subreddit to provide simplified explanations about any topic, and the HCRC Map Task [182] where multiple speakers try to provide someone with directions to reach some place.

In the context of mental healthcare, counselling and psychotherapy represent two task-oriented dialogue problems that are actually approached with open-domain dialogue techniques. In fact, counselling or therapy sessions are actually open domain dialogues. There exists some data sets that are . Counseling and Psychotherapy Transcripts: Volume II [115] and HOPE [107] (now extended to MEMO corpus [176]) are transcription of sessions. Moreover, HOPE contains also labels with dialogue acts associated to each utterance, and its extension MEMO contains labels with the status of the session and the summary of the interaction. Additionally, there exists also the Counsel Chat [17] and the Epitome [164] corpora. These two corpora are composed only of message-response pairs instead of entire conversations. Epitome is labelled with the communication level of different empathy mechanisms.

Certain corpora present some kind of grounding. It is the case of corpora for persona grounding, which are helpful to ensure agent’s consistency. Persona groundings can be provided for the agent, the user or both. Having access to both persona descriptions, improves the agent’s dialogue capabilities [92]. The Persona-Chat [214] corpus contains simulated conversations preceded by short sentences describing participants’ personas. The corpus also provides variants of the persona descriptions, and contrastive samples for discriminative training and evaluation (this chatbot was used to train TransferTransfo [202] in the ConvAI-2 challenge [49] and to do the first fine-tuning of CAiRE [99]). The LIGHT [184] corpus, instead, has dialogues grounded in a text adventure game. It also provides information about the actions performed within the game (situation grounding). More recently the Multi-Session Chat [206] was released, this corpus uses persona grounding and was specifically realised to learn managing long conversations.

Other forms of grounding include knowledge. Some corpora provide this grounding through document exchanges that are associated with dialogue turns. It is the case of the Wizard-of-Wikipedia [50], Topical-Chat [64], and Wizard-of-Internet [86] corpora. In all cases the speakers refer to external knowledge sources (Wikipedia for the first corpus, many sources for the other two) where text chunks are extracted and added to the conversation turns, as metadata. In the first two corpora, the topic is explicitly associated with the conversation.

Finally, it is worth to mention image grounded corpora, like Image Chat [167] and IGN [114]. Due to the emerging presence of chatbots in social media, and given the relevance that images play in such contexts, providing chatbots with the capability of handling visual contents is essential. Both corpora are taken from conversations about a given image: Image Chat provides also persona information, IGC is focused on question-answer dialogues.

As a conclusive remark, it is important to point out that the corpora used for fine-tuning are often curated to avoid unsafe and biased content. In some cases, for example, when pre-training on large corpora crawled from the web, the resulting corpus may contain harmful examples. To cope with this issue some filtering techniques have been suggested [205, 216]. These techniques aim at filtering the corpus to remove unsafe, biased, or negative utterances.

B.1.3 Benchmarks and Collections.

Lately, there has been an emerging interest in building benchmarks for conversational agents. The idea behind these benchmarks is to collect multiple corpora covering different aspects and dialogue styles, and combine them to test automatically all these capabilities. These benchmarks offer both training and evaluation data. In the following, some of them are presented.

The DodecaDialogue [168] covers 12 different corpora that are used both for pre-training and fine-tuning. The fine-tuning ones allow for grounding on persona, knowledge, situation, and images. It is mainly though for generative agents since it evaluates the agent generative abilities in many different contexts. In fact, the baseline models trained on these corpora have been evaluated on PPL, \(\mathrm{F_1}\)-score, BLEU and ROUGE.

The Silicone [33] benchmark, instead, is mostly for sequence labelling rather than generation. It is composed of eight different corpora, including one for task-oriented conversations. The labels of the corpora provide dialogue acts and affect (either emotion or sentiment). Despite being for sequence labelling, this benchmark can be employed to evaluate generative agents in the same way as DodecaDialogue.

The KILT [128] benchmark is a resource for knowledge intensive language tasks. This benchmark is composed of 11 different data sets targeting five tasks. Apart from the dialogue task (provided by the Wizard-of-Wikipedia data set), KILT includes open domain question answering, slot filling, entity linking and fact checking. This benchmark is useful to improve knowledge grounded agents and to learn to provide support to factual responses.

The BlendedSkillTalk [173] targets a chatbot ability to manage different tasks within a conversation. The considered tasks are to talk about oneself as well as getting to know the conversation partner, displaying empathetic behaviour and being knowledgeable. BlendedSkillTalk is built starting from pre-exisiting data sets (namely Persona-Chat, EmpatheticDialogue and Wizard-of-Wikipedia) that are used as prompts for human workers to generate new dialogues. These dialogues are labelled to help the agent understand whether an utterance in the dialogue requires external knowledge, requires personal knowledge, describes a personal situation, or is a display of empathy. This benchmark permits to evaluate the grounding skills of the agent as well as its ability to understand when and which grounding is necessary.

B.2 Competitions

Competitions for chatbots are thought to promote sharing the latest results in the design of conversational systems. They cover many kinds of approaches for open-domain models and sometimes involve multiple interaction modalities.

The Loebner Prize Competition (Loebner prize for AI) [131] was the oldest competitions for chatbots. It is currently discontinued and was based on the Turing Test for machine intelligence. The latest registered winner (in 2018) was the Mitsuku chatbot (now Kuki AI) [203], which is based on the AI Markup Language (AIML) for its core functions [190].

The Dialog State Tracking Challenge (DSTC) [47, 67, 73, 82] was first organized in 2013 and reached the 10th edition.²¹ It is mainly thought for task-oriented systems. However, nowadays, this challenge is divided into multiple tracks, also involving open-domain conversational agents.

The NTCIR Short Text Conversation Task [161, 162, 213, 215] started as a retrieval task for Chinese and Japanese chatbots, but it was extended to include generative models, from the second edition. It is mainly designed for “post-comments” kind of conversations found in social media like Weibo²² or Twitter. In the latest edition, the organisers also introduced a task for emotion grounded conversations.

The Conversational AI Challenge (ConvAI) [3, 25, 49] is a series of competitions for open-domain agents. Each edition presents a new challenge (e.g., “clarifying questions”). The second edition is particularly famous for the results obtained by transfer-learning and fine-tuned models[202], in the persona grounded conversations (on Persona-Chat [214]). Such transfer-learning based models have become a prevalent approach for neural chatbots, as explained in Section A.1.2.

The Alexa Prize [138] is a challenge organised by Amazon to improve the Alexa virtual assistant. The idea is to give chit-chat capabilities to Alexa to make it more entertaining. The prize currently proposes three challenges, including one for social chatbots, which is specifically thought for open-domain conversational agents.

References

[1]

Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. arXiv:2001.09977. Retrieved from https://arxiv.org/abs/2001.09977.

Abstract

1 Introduction

2 Background

2.1 Neural Networks for Sequence Modelling

2.1.1 Deep Neural Networks.

2.1.2 Recurrent Neural Networks.

2.1.3 Transformer Neural Networks.

2.2 Text Representation and Processing

2.2.1 Vector Semantics and Embeddings.

2.2.2 Probabilistic Language Models.

3 Basic Architectures

3.1 Base Models

3.1.1 Causal Decoder.

3.1.2 Bi-directional Encoder.

3.1.3 Transducer.

3.2 Hierarchical Models

3.2.1 Latent Hierarchical Models.

3.2.2 Variational Hierarchical Models.

4 Advanced Topics

4.1 Hybrid Models

4.1.1 Multi-objective.

4.1.2 Retrieve-and-refine.

4.2 Decoding

4.2.1 Single Sequence.

4.2.2 Multiple Sequences.

4.3 Conditioning and Prompting

5 Conclusion

5.1 Concluding Remarks

5.2 Future Directions

5.3 Open Problems

Footnotes

A Training and Evaluation

A.1 Training Approaches

A.1.1 Pre-training.

A.1.2 Fine-tuning.

A.1.3 Objective Optimisation.

A.2 Evaluation Approaches

A.2.1 Automatic Evaluation.

A.2.2 Human Evaluation.

B Corpora and Competitions

B.1 Corpora

B.1.1 Pre-training.

B.1.2 Fine-tuning.

B.1.3 Benchmarks and Collections.

B.2 Competitions

References

Cited By

Index Terms

Recommendations

Open Domain Chatbot Based on Attentive End-to-End Seq2Seq Mechanism

Comparison of performance of enhanced morpheme-based language model with different word-based language models for improving the performance of Tamil speech recognition system

Multi-language Reverse Dictionary Model Based on Improved mBERT

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations