The first category of explanations refers to explaining the predictions generated by LLM. Let us consider a scenario where we have a language model and we input a specific text into the model. The model then produces a classification output, such as sentiment classification or a prediction for the next token. In this scenario, the role of explanation is to clarify the process by which the model generated the particular classification or token prediction. Since the goal is to explain how the LLM makes the prediction for a specific input, we call it the
local explanation. This category encompasses four main streams of approaches for generating explanations including feature attribution-based explanation, attention-based explanation, example-based explanation, and natural language explanation (see Figure
3).
3.1.1 Feature Attribution-Based Explanation.
Feature attribution methods aim at measuring the relevance of each input feature (e.g., words, phrases, and text spans) to a model’s prediction. Given an input text \(\boldsymbol {x}\) comprised of n word features \(\lbrace x_1, x_2, \ldots , x_n\rbrace\), a fine-tuned language model f generates an output \(f(\boldsymbol {x})\). Attribution methods assign a relevance score \(R(x_i)\) to the input word feature \(x_i\) to reflect its contribution to the model prediction \(f(\boldsymbol {x})\). The methods that follow this strategy can be mainly categorized into four types: perturbation-based methods, gradient-based methods, surrogate models, and decomposition-based methods.
Perturbation-Based Explanation. Perturbation-based methods work by perturbing input examples such as removing, masking, or altering input features and evaluating model output changes. The most straightforward strategy is
leave-one-out, which perturbs inputs by removing features at various levels including embedding vectors, hidden units [Li et al.
2017], words [Li et al.
2016], tokens and spans [Wu et al.
2020a] to measure feature importance. The basic idea is to remove the minimum set of inputs to change the model prediction. The set of inputs is selected with a variety of metrics such as confidence score or reinforcement learning. However, this removal strategy assumes that input features are independent and ignores correlations among them. Additionally, methods based on the confidence score can fail due to pathological behaviors of overconfident models [Feng et al.
2018]. For example, models can maintain high-confidence predictions even when the reduced inputs are nonsensical. This overconfidence issue can be mitigated via regularization with regular examples, label smoothing, and fine-tuning models’ confidence [Feng et al.
2018]. Besides, current perturbation methods tend to generate
out-of-distribution (
OOD) data. This can be alleviated by constraining the perturbed data to remain close to the original data distribution [Qiu et al.
2021].
Gradient-Based Explanation. Gradient-based attribution techniques determine the importance of each input feature by analyzing the partial derivatives of the output with respect to each input dimension. The magnitude of derivatives reflects the sensitivity of the output to changes in the input. The basic formulation of raw gradient methods is described as
\(\boldsymbol {s}_j=\frac{\partial f(\boldsymbol {x})}{\partial \boldsymbol {x}_j},\) where
\(f(\boldsymbol {x})\) is the prediction function of the network and
\(\boldsymbol {x}_j\) denotes the input vector. This scheme has also been improved as gradient
\(\times\) input [Kindermans et al.
2017] and has been used in various explanation tasks, such as computing the token-level attribution score [Mohebbi et al.
2021]. However, vanilla gradient-based methods have some major limitations. First, they do not satisfy the input invariance, meaning that input transformations such as constant shift can generate misleading attributions without affecting the model prediction [Kindermans et al.
2017]. Second, they fail to deal with zero-valued inputs. Third, they suffer from gradient saturation where large gradients dominate and obscure smaller gradients. The difference-from-reference approaches, such as
integrated gradients (
IG), are believed to be a good fit to solve these challenges by satisfying more axioms for attributions [Sundararajan et al.
2017]. The fundamental mechanism of IG and its variants is to accumulate the gradients obtained as the input is interpolated between a reference point and the actual input. The baseline reference point is critical for reliable evaluation, but the criteria for choosing an appropriate baseline remain unclear. Some use noise or synthetic reference with training data, but performance cannot be guaranteed [Lundstrom et al.
2022]. In addition, IG struggles to capture output changes in saturated regions and should focus on unsaturated regions [Miglani et al.
2020]. Another challenge of IG is the computational overhead to achieve high-quality integrals. Since IG integrates along a straight line path that does not fit well the discrete word embedding space, variants have been developed to adapt it for language models [Sikdar et al.
2021; Sanyal and Ren
2021; Enguehard
2023].
Surrogate Models. Surrogate models methods use simpler, more human-comprehensible models to explain individual predictions of black-box models. These surrogate models include decision trees, linear models, decision rules, and other white-box models that are inherently more understandable to humans. The explanation models need to satisfy additivity, meaning that the total impact of the prediction should equal the sum of the individual impacts of each explanatory factor. Also, the choice of interpretable representations matters. Unlike raw features, these representations should be powerful enough to generate explanations yet still understandable and meaningful to human beings. An early representative local explanation method called LIME [Ribeiro et al.
2016] employs this paradigm. To generate explanations for a specific instance, the surrogate model is trained on data sampled locally around that instance to approximate the behavior of the original complex model in the local region. However, it is shown that LIME does not satisfy some properties of additive attribution, such as local accuracy, consistency, and missingness [Lundberg and Lee
2017b]. SHAP is another framework that satisfies the desirable properties of additive attribution methods [Lundberg and Lee
2017b]. It treats features as players in a cooperative prediction game and assigns each subset of features a value reflecting their contribution to the model prediction. Instead of building a local explanation model per instance, SHAP computes Shapley values [Shapley et al.
1953] using the entire dataset. Challenges in applying SHAP include choosing appropriate methods for removing features and efficiently estimating Shapley values. Feature removal can be done by replacing values with baselines like zeros, means, or samples from a distribution, but it is unclear how to pick the right baseline. Estimating Shapley values also faces computational complexity exponential in the number of features. Approximation strategies including weighted linear regression, permutation, and other model-specific methods have been adopted [Chen et al.
2023b] to estimate Shapley values. Despite complexity, SHAP remains popular and widely used due to its expressiveness for large deep models. To adapt SHAP to Transformer-based language models, approaches such as TransSHAP have been proposed [Chen et al.
2023b; Kokalj et al.
2021]. TransSHAP mainly focuses on adapting SHAP to sub-word text input and providing sequential visualization explanations that are well suited for understanding how LLMs make predictions.
Decomposition-Based Methods. Decomposition techniques aim to break down the relevance score into linear contributions from the input. Some work assigns relevance score directly from the final output layer to the input [Du et al.
2019b]. The other line of work attributes relevance score layer by layer from the final output layer toward the input.
Layer-wise relevance propagation (
LRP) [Montavon et al.
2019] and
Taylor-type decomposition (
DTD) approaches [Montavon et al.
2015] are two classes of commonly used methods. The general idea is to decompose the relevance score
\(R_j^{(l+1)}\) of neuron
j in layer
\(l+1\) to each of its input neuron
i in layer
l, which can be formulated as
\(R_j^{(l+1)}=\sum _i R_{i \leftarrow j}^{(l, l+1)}.\) The key difference is in the relevance propagation rules used by LRP versus DTD. These methods can be applied to break down relevance scores into contributions from model components such as attention heads [Voita et al.
2019], tokens, and neuron activation [Voita et al.
2021]. Both methods have been applied to derive the relevance score of inputs in Transformer-based models [Wu and Ong
2021; Chefer et al.
2021].
3.1.2 Attention-Based Explanation.
Attention mechanism is often viewed as a way to attend to the most relevant part of inputs. Intuitively, attention may capture meaningful correlations between intermediate states of input that can explain the model’s predictions. Many existing approaches try to explain models solely based on the attention weights or by analyzing the knowledge encoded in the attention. These explanation techniques can be categorized into three main groups: visualization methods, function-based methods, and probing-based methods. As probing-based techniques are usually employed to learn global explanations, they are discussed in Section
3.2.1. In addition, there is an extensive debate in research on whether attention weights are actually suitable for explanations. This topic will be covered later in the discussion.
Visualizations. Visualizing attentions provides an intuitive way to understand how models work by showing attention patterns and statistics. Common techniques involve visualizing attention heads for a single input using bipartite graphs or heatmaps. These two methods are simply disparate visual representation of attentions, one as a graph and the other as a matrix, as illustrated in Figure
4. Visualization systems differ in their ability to show relationships at multiple scales, by representing attention in various forms for different models. At the input data level, attention scores for each word/token/sentence pairs between the premise sentence and the assumption sentence are shown to evaluate the faithfulness of the model prediction [Vig
2019]. Some systems also allow users to manually modify attention weights to observe effects [Jaunet et al.
2021]. At the neuron level, individual attention heads can be inspected to understand model behaviors [Park et al.
2019; Vig
2019; Hoover et al.
2020; Jaunet et al.
2021]. At the model level, attention across heads and layers is visualized to identify patterns [Park et al.
2019; Vig
2019; Yeh et al.
2023]. One notable work focuses on visualizing attention flow to trace the evolution of attention, which can be used to understand information transformation and enable training stage comparison between models [DeRose et al.
2020]. Thus, attention visualization provides an explicit, interactive way to diagnose bias, errors, and evaluate decision rules. Interestingly, it also facilitates formulating explanatory hypotheses.
Function-Based methods. Since raw attention is insufficient to fully explain model predictions, some studies have developed enhanced variants as replacements to identify important attributions for explanation. Gradient is a well-recognized metric for measuring sensitivity and salience, so it is widely incorporated into self-defined attribution scores. These self-designed attribution scores differ in how they define gradients involving attention weights. For example, gradients can be partial derivatives of outputs with respect to attention weights [Barkan et al.
2021] or integrated versions of partial gradients [Hao et al.
2021]. The operations between gradients and attention can also vary, such as element-wise products. Overall, these attribution scores that blend attention and gradients generally perform better than using either alone, as they fuse more information that helps to highlight important features and understand networks.
Debate Over Attention. There is extensive research evaluating attention heads, but the debate over the validity of this approach is unlikely to be resolved soon. The debate stems from several key aspects. First, some works compare attention-based explanations with those from other methods like LIME. They find that attention often does not identify the most important features for prediction [Serrano and Smith
2019; Jain and Wallace
2019]. They provide inferior explanations compared to these alternatives [Thorne et al.
2019] or cannot be correlated with other explanation methods [Jain and Wallace
2019; Liu et al.
2020; Ethayarajh and Jurafsky
2021]. Second, some directly criticize the usefulness of the attention mechanism in model predictions. They argue that raw attention fails to capture syntactic structures in text and may not contribute to predictions as commonly assumed [Mohankumar et al.
2020]. In addition, raw attention contains redundant information that reduces its reliability in explanations [Bai et al.
2021; Brunner et al.
2019]. However, other studies contradict these claims. For example, evaluating explanation models for consistency can pose challenges across various approaches, not limited to attention alone [Neely et al.
2021]. Besides, manipulation of attention weights without retraining can bias evaluations [Wiegreffe and Pinter
2019]. Furthermore, attention heads in BERT have been shown to encode syntax effectively [Clark et al.
2019]. To make attention explainable, technical solutions have also been explored by optimizing input representation [Mohankumar et al.
2020], regularizing learning objectives [Moradi et al.
2021], avoiding biased learning [Bai et al.
2021], and even incorporating human rationales [Arous et al.
2021]. However, the core reason for the ongoing debates is the lack of well-established evaluation criteria, which will be further discussed in Section
5.1.
3.1.3 Example-Based Explanations.
Example-based explanations intend to explain model behavior from the perspective of individual instances [Koh and Liang
2017]. Unlike model-based or feature-based explanations, example-based explanations illustrate how a model’s output changes with different inputs. We focus on adversarial examples, counterfactual explanations, and data influence. Adversarial examples are generally synthesized by manipulating less important components of input data. They reveal cases where the model falters or errs, illuminating its weaknesses. In contrast, counterfactual explanations are generated mostly by changing significant parts of input data, and they are popular in scenarios like algorithmic recourse, as providing remedies to a desirable outcome. Unlike manipulating inputs, data influence examines how training data impacts a model’s predictions on testing data.
Adversarial Example. Studies show that neural models are highly susceptible to small changes in the input data. These carefully crafted modifications can alter model decisions while barely being noticeable to humans. Adversarial examples are critical in exposing areas where models fail and are usually added to training data to improve robustness and accuracy. Adversarial examples are initially generated by word-level manipulations such as errors, removal, and insertion, which are obvious upon inspection. More advanced token-level perturbation methods like TextFooler [Jin et al.
2020] have been advanced, which strategically target important words first based on ranking. A candidate word is then chosen based on word embedding similarity, same part-of-speech, sentence semantic similarity, and prediction shift. However, word embedding is limited in sentence representation compared to contextualized representations, often resulting in incoherent pieces. By focusing on contextualized representations, a range of work adopting the mask-then-infill procedure has achieved state-of-the-art performance [Garg and Ramakrishnan
2020; Li et al.
2021a]. They leverage pre-trained masked language models like BERT for perturbations including replacement, insertion, and merging. Typically, a large corpus is employed to train masked language models, generate contextualized representations and obtain token importance. Then models are frozen and perturbation operations are performed on tokens in a ranked order. For replacement, the generated example replaces the masked token. For injection, the new token is inserted into the left or right of the masked token. For merging, a bigram is masked and replaced with one token. SemAttack [Wang et al.
2022b] proposes a more general and effective framework applicable to various embedding spaces including typo space, knowledge space, and contextualized semantic space. The input tokens are first transformed into an embedding space to generate perturbed embeddings that are iteratively optimized to meet attack goals. The experiment shows that replacing 5% of words reduces BERT’s accuracy from 70.6% to 2.4% even with defenses in a white-box setting. SemAttack’s outstanding attack performance might because it directly manipulates embeddings.
Counterfactual Explanation. Counterfactual explanation is a common form of casual explanation, treating the input as the cause of the prediction under the Granger causality. Given an observed input
\(\boldsymbol {x}\) and a perturbed
\(\hat{\boldsymbol {x}}\) with certain features changed, the prediction
\(\boldsymbol {y}\) would change to
\(\hat{\boldsymbol {y}}\). Counterfactual explanation reveals what would have happened based on certain observed input changes. They are often generated to meet up certain needs such as algorithmic recourse by selecting specific counterfactuals. Examples can be generated by humans or perturbation techniques like paraphrasing or word replacement. A representative generator, Polyjuice [Wu et al.
2021], supports multiple permutation types for input sentences, such as deletion, negation, and shuffling. It can also perturb tokens based on their importance. Polyjuice then finetunes GPT-2 on specific pairs of original and perturbed sentences tailored to downstream tasks, to provide realistic counterfactuals. It generates more extensive counterfactuals with a median speed of 10 seconds per counterfactual, compared to 2 minutes for previous crowd workers-dependent methods [Kaushik et al.
2020]. Counterfactual explanation generation has been framed as a two-stage approach involving first masking/selecting important tokens and then infilling/editing those tokens [Treviso et al.
2023; Ross et al.
2021]. Specifically, MiCE uses gradient-based attribution to select tokens to mask in the first stage and focuses on optimizing for minimal edits through binary search [Ross et al.
2021]. In contrast, CREST leverages rationales from a selective rationalization model and relaxes this hard minimality constraint of MiCE. Instead, CREST uses the sparsity budget of the rationalizer to control closeness [Treviso et al.
2023]. Experiments show that both methods generate high-quality counterfactuals in terms of validity and fluency.
Data Influence. This family of approaches characterizes the influence of individual training sample by measuring how much they affect the loss on test points [Yeh et al.
2018]. The concept originally came from statistics, where it describes how model parameters are affected after removing a particular data point. By observing patterns of influence, we can deepen our understanding of how models make predictions based on their training data. Since researchers have come to recognize the importance of data, several methods have been developed to analyze models from a data-centric perspective. Firstly, influence functions enable us to approximate the concept by measuring loss changes via gradients and Hessian-vector products without the necessity of retraining the model [Koh and Liang
2017]. Yeh et al. [
2018] decompose a prediction of a test point into a linear combination of training points, where positive values denote excitatory training points and negative values indicate inhibitory points. Data Shapley employs Monte Carlo and gradient-based methods to quantify the contribution of data points to the predictor performance, and the higher Shapley value tells the desired data type to improve the predictor [Ghorbani and Zou
2019]. Another method uses
stochastic gradient descent (
SGD) and infers the influence of a training point by analyzing minibatches without that point using the Hessian vector of the model parameters [Hara et al.
2019]. Based on such an approach, TracIn derives the influence of training points using the calculus theorem with checkpoints during the training process [Pruthi et al.
2020]. However, the aforementioned methods often come with an expensive computational cost even when applied to a medium-sized model. To address this, two key dimensions can be considered: (1) reducing the search space and (2) decreasing the number of approximated parameters in the Hessian vector. Guo et al. [
2020] also demonstrates the applicability of the influence function in model debugging. Recently, Anthropic has employed the
Eigenvalue-corrected Kronecker-Factored Approximate Curvature (
EK-FAC) approximation to scale this method to LLMs with 810 million, 6.4 billion, 22 billion, and 52 billion parameters. The result indicates that as model scale increases, influential sequences are better at capturing the reasoning process for queries, whereas smaller models often provide semantically unrelated pieces of information [Grosse et al.
2023].