[Topics] Test-Time Compute for Topics: Embrace uncertainty to reduce hallucinations, with multi-sample and/or semantic entropy #1880
Labels
feature-request
For new feature suggestions
8000
Test-Time Compute for Topics: Embrace uncertainty to reduce hallucinations, with multi-sample and/or semantic entropy
This is a long-form issue, in three parts:
Exploiting multi-sampling for Topics
A common way to improve LLM output, in particular against hallucinations, is to use Test-Time compute: make the LLM work extra, with multiple calls, to get a better answer -- with, or without fine-tuning. See for example the Introduction in (Snell et al. 2024) for an accessible overview and key references [the rest of the article is very thorough but more advanced than what we need for now]. This is also, at its simplest, what is being done by some algorithms who keep trying until the LLM returns an answer that passes sanity checks (e.g. non-empty output).
Another place where sampling multiple times is used is in Bayesian sampling: get a sample from the distribution, rather than a simple “best” output. This allows for a measure of how certain the model is of its answer. We can use this with LLM too. Indeed, sampling multiple answers and looking at the distribution of these outputs is a form of compute at test time, where we use the whole output rather than keep only the best.
The algorithm looks as follows when applying it to the topic categorization task, given a list of pre-specified topics (e.g. provided by the user or discovered by a first step of the algorithm).
For each comment:
There are a few caveats for equating multi-topics with multi-modes: the entropy will be higher when several categories have high probability, thus the threshold needs to account for that. See below (extension of semantic entropy) for an alternative approach that could be more robust.
Notes on a few parameters and refinements:
Semantic entropy
A more advanced version of the above is possible and was recently published in Nature (Farquhar et al. 2024), and got some press in TIME Magazine (Perrigo 2024).
In a nutshell, its nice refinement is that, instead of using the entropy of the distribution of the raw answers, it first groups the answers by meaning, to account for various output formats (“The capital of France is Paris” vs “Paris is the capital of France” vs “Paris”). All the answers which are equivalent to each another (in the sense of “double entailment”, i.e. A implies B and B implies A) are collapsed into one same equivalence class/category. Entailment is judged by asking the LLM “does A imply B”. Efficient grouping is done via a classical connected-component algorithm from graph theory.
This amounts to semantic entropy, in the sense that it takes the entropy of semantically distinct answer classes rather than just character-based distinct answers. They then decide that the algorithm cannot answer, if the entropy is too high, or pick the mode of the distribution as the right answer -- and they demonstrate this reduces hallucinations. Illustrated nicely in Figure 1.a from their article:
That is the gist we need. The main content of the paper discusses the following two extra ideas, but we can skip it for topics:
Choice of parameters (thresholds, temperatures), as always, is a question. In their article, the authors compare various methods and thus report AUROC (aka classical AUC) and AURAC (new method taking into account non-response), which are integrated over thresholds. We do not have that luxury in practical applications, so we need to tune that threshold as discussed above. As to the temperature, they just pick one without explaining why.
Extending semantic entropy to multilabel
Generalization of semantic entropy to multilabel (aka multi-topics): the semantic entropy article assumes that there is a single correct answer to each question. To handle multilabel, we can either:
Option A is nice and simple, but has a few potential issues:
Option B needs more careful thinking, but avoids these issues, by explicitly treating multilabel and expecting as unique answers a full set of labels
This latter part requires a bit more thought (as mentioned in #1877 , a literature review on multilabel would be handy), but that’s a starting point!
References
The text was updated successfully, but these errors were encountered: