1 Introduction

Since the introduction of transformer language model (LM) architectures [35], large language models (LLMs) have quickly shown to adopt somewhat general problem solving capabilities [24]. After being aligned to human preferences, as it is called in [19], they are widely used for chat purposes [18], even in safety-critical application domains such as medical question answering: Med-PaLM [29] or even Google’s most evolved Med-PaLM version [30] have indeed shown strong helpfulness and low probability of harm. However, a critical chance of factually wrong but confident answers, often called hallucinations [12], remains an issue even for the latest popular LLMs [18].

Therefore, it is crucial for a user to be able to evaluate a model’s claims, reducing the critical reliance on trust in situations with possibly fatal failure consequences. In the medical query-answering (QA) setting, the system Almanac [38] uses retrieval-augmented generation (RAG). It starts with retrieving relevant articles from a specialized medical database, which are then used to enrich the context of the request to generate factually correct answers. Enriching context reduces the need of the applied LM to rely on its implicitly learned knowledge. This reduces hallucinations caused by factually wrong knowledge, or continuing generation without knowledge. Further, a more specific context simplifies the task of well structured generation and may in turn reduce hallucinations caused by unfortunate sampling. Therefore, Almanac mitigates misinformation, but although we can observe reduced hallucination occurrence, we argue that with any vocabulary sampling and without rather infeasible complete understanding of the LM, hallucination cannot be eliminated.

As a consequence, a user in doubt needs to carefully inspect the linked resources for confirmation. Verifying answers using resources directly provided by the system instead of using external search engines can also indicate whether the system properly understood the input question. However, deciding whether provided text articles really support the generated answer is not a simple task and could overexert the user’s time or domain knowledge.

Since those users who do not have enough time or experience could benefit the most from such a system, we aim to mitigate these obstacles by reducing complexity and ambiguity of supporting resources. Specifically, we adopt a knowledge base (KB) to provide graphical plots of extracts as trustworthy and easily-checked evidence.

We propose to fine-tune a pre-trained transformer LM as request and answer engine, using samples generated based on gap text formulations of KB relations backed by a trainer LM for fluent statement conjunction. The target model learns to classify relevant concepts. Answer generation is guided by prompt tuning where the prompts are assembled from extracted concepts. Following common practice in sharing weight matrices of word embeddings and LM parameters, we share concept embeddings for prompt composition with the concept-classification weights.

Compared to classical RAG, apart from the fact that our retrieved content can be depicted visually, our proposed framework has the following key benefits:

  1. 1.

    AutoRAG is designed by the motivation of further reducing hallucinations: Apart from making beliefs explicit for communication, symbolic models such as our knowledge bas make beliefs explicit for our reasoning as well, which we aim to apply within the model to stay focused rather than to lose attention within long context. Further, the close connection between LM and KB reduces flexibility of the generative model, which is intended since we do not want the model to lose focus on specific KB concepts. And finally, the retrieval task, typically solved externally, could even improve generation focus due to weight sharing with closely related tasks. To properly show the effectiveness of these motivations, the example implementation has to be carried out with explicit focus on quantitative comparison to RAG. See Sect. 7 for further information.

  2. 2.

    The retrieved concepts are not independent of each other. This does hold for classical RAG to some extent as well, since similar documents will appear together more frequently, resulting in re-occurring collections of sequences. For AutoRAG, this builds relatively short re-occurring symbol sequences which can be recognized more easily. This fact can be observed in our example implementation, see Table Table 4.

  3. 3.

    In both RAG and AutoRAG there needs to be a decision on how much of the retrieved context to keep for generation: Keeping more content potentially provides the model with more relevant information to work with. In classical RAG however, this quickly results in a long context, with naive databases even easily over-representing common information. AutoRAG presents knowledge without any redundancy in a compressed way using few representatives only. In Sect. 5.3 this is shown on a practical example.

For our implementation in an example use case, our KB is derived from a medical cause and effects analysis (MCEA) model [17] made by physicians for the field of nosocomial pneumonia. A small example for an medical cause and effects analysis (MCEA) model is depicted in Fig. 1.

Fig. 1
figure 1

A small example showing an MCEA model where the entities are given by the nodes and the relations by the edges in the graph. The illustration is reprinted from [17] and slightly modified. Further details on an MCEA model are also given in [17]

It consists of system elements (\(c_1\), \(c_2\)), functions with parameters (\(f_1\) with \(v_1\), \(f_2\) with \(v_2\)), failures (\(e_1\), \(e_2\)), as well as actions (\(a_1\), \(a_2\)) and can be represented by a directed acyclic graph where the edges determine the hierarchy of nodes with the same type and the relations between them. The MCEA model actually features Markov decision process (MDP) [2] semantics for reasoning about failure states and possible solutions. More details on the MCEA model used in our implementation can be found in [17]. Using such a semantically rich basis for our approach allows for generating expressive training data, possibly even imparting reasoning skills. Still, the KB of our AutoRAG approach is modeled as a rather general set of entities and relations. This is chosen to keep the focus on its basic idea, which is independent of the semantics of its KB’s basis semantics and could therefore be applied in other domains as well. The practical mapping of the basis to a KB for AutoRAG may obscure the deeper semantics, but as long as the mapping does not drop salient features of the basis, the LM can learn relevant semantics from training data if it has been generated by applying semantics of the basis. With this idea, we map the whole model to 912 entities, taken from, e.g., system elements, functions, failures, etc. and 30 relations (4 unary, 14 binary, 12 ternary), with meanings such as “X is a system element” or “With X, an increase in Y decreases Z.” The mapping is subjected to model interpretation. Instead of the unary relations, we could also add four more entities, e.g., “system element” and replace all unary relations by a more general binary relation “X is a Y”. Using a unique unary relation for each concept, however, is a handy feature for the LM: Since our implementation only presents the entities without explicit relations, using the unique unary relation’s verbalization, a single concept directly translates to a statement, easily learned during training. Therefore it can even handle unrelated entities in the same context, presumably with lower risk of hallucinating a relation.

Fig. 2
figure 2

MCEA question answering examples, presenting answer text with additional visualization

Proving the fact, that the model can indeed learn semantics from the basis for reasoning tasks, however, remains motivation for further work, since our training data does not yet incorporate model semantics. Instead, our implementation describes entities and their relations associated with an input text. See Fig. 2 for an output of our example system.

The rest of this work is organized as follows: We start with a discussion of related work and formulate our target task in more detail afterwards. Before we dive into our proposed framework, we describe basic parts of transformer LMs we will need. Right after presenting the main ideas of AutoRAG, we outline our implementation including training and evaluation. Finally, we list some limitations of this work, discuss our method from a wider perspective, and conclude.

2 Related Work

This work is motivated by a practical example from the medical domain and our proposed solution is inspired by multiple recent trends in the field of natural language processing. In this section, we discuss approaches using some of the techniques we implemented. For a comprehensive comparison to direct alternative techniques, the reader is referred to cited work.

In this work, we train a relatively small and affordable LM to act in a chat-like setting, in an attempt to make useful custom LMs accessible, in contrast to strong models such as ChatGPT,Footnote 1 a very large auto-regressive LM evolved from GPT-3 [3] using reinforcement learning with human feedback [19]. Like OpenAI’s more recent GPT-4 [18] or Google’s BardFootnote 2 using their latest PaLMFootnote 3 model, among others, ChatGPT is accessible for basic prompt answering or even fine-tuning, but not for interpretation of arbitrary intermediate embeddings or procedures. Due to these models’ sizes and the common practice to not publish model weights, public use is limited in both customization of inference and model customization using fine-tuning. To mitigate at least the issue of closed source, open-source models like LLaMA [33] and LLaMA 2 [34] have been pre-trained and released. Building on these open-source models, effective and accessible, task oriented, models have become popular, following instructions the user describes in a prompt. Some approaches use simple supervised fine-tuning with specialized high-quality datasets [4, 32, 36], extending the limits of model size to be handled using a single consumer GPU up to the largest LLaMA versions [5, 11]. To better control task specific output, some approaches propose to train further parameters, either as additional layers [10] or as task specific context with prefix or prompt tuning [13, 15, 40], contrasting primitive prompt engineering [16, 25, 27, 42], available for closed source LLMs. AutoRAG uses both supervised fine-tuning as well as prompt-tuning to generate task specific output, while referring to and using a pre-defined knowledge base. In general, equipping an LM with a knowledge base is a popular idea as well. Some approaches introduce special architecture changes to tightly steer generation using a knowledge base [6, 22, 41]. More similar to AutoRAG is the idea of KALM [26] for training with explicit entity awareness, since it uses standard transformer layers and splits word and entity generation/classification in two different heads. In contrast to ours, their approach focuses more on general knowledge enhanced pre-training, instead of focusing on a specific knowledge base and stressing a tight connection of text and entities. They train in much larger scale and show a benefit of grounding language generation in structured knowledge, compared to conventional LLMs in a general QA setting. For our training data, we use knowledge graph verbalizations from the field of knowledge graph-to-text approaches [1]. Therefore, from the dimensions described in survey papers on combining LMs with existing knowledge bases [20, 21], AutoRAG uses all three: Training from knowledge base data, using knowledge base entities for prompt construction and for augmenting generated output. Our prompt tuning based on concept embedding sequences is an approach both for controllable text generation [39] and context compression [23]. Using retrieved knowledge in addition to generated text, AutoRAG is further connected to RAG [7, 14], especially for trustworthy output in critical application domains [38]. Compared to other dense retrieval techniques optimized during training [8, 28], generating training text from concepts naturally yields clean retrieval labels. In comparison to plain retrieval augmented generation, rather than querying an external knowledge source, we directly encode the formally modeled knowledge into the LM. For retrieval, we force the model to associate concepts explicitly by introducing an intermediate one-hot context encoding. Similar to [31] but in a compressed manner, this narrows the gap of training data and real world samples to more effectively access learned facts for answer generation.

For our AutoRAG example implementation, we deploy a formal model (FM) from a medical domain, developed using failure mode and effects analysis [17]. The semantics of this model are defined using an MDP [2]. Assuming that an LM is able to infer a problem state for an FM such as an MDP, the FM’s semantics could be used to automatically verify that the LM’s output actually describes a valid problem solution. Furthermore, AutoRAG not only grounds the LM in the symbols of an FM—the verbalization grounds the FM’s symbols in language. Given that the LM can meaningfully explain the concepts and their role in the underlying formal model, it is a solution to the symbol grounding problem [9].

To the best of our knowledge, this is the first work introducing concepts given by a KB to a pre-trained LM by explicitly extending the input/output vocabulary.

3 Problem Formulation

In this section, we briefly state the problem we address in this paper and describe salient parts formally, guiding through an example taken out of our implementation afterwards.

Given We start with a KB and basic verbalizations. Let \(L\subseteq \varSigma ^*\) be the set of strings of some natural language based on a token alphabet \(\varSigma\). Formally, the KB consists of an entity domain E and a set of relations \(R=\{r_1,r_2, \dots \}\), with \(r_i\subseteq E^{\text {deg}(r_i)}\), i.e., \(\text {deg}(\cdot )\) denotes the degree of a relation and \(r_i\) is a \(\text {deg}(r_i)\)-ary relation. Entities and relations are accompanied by an entity verbalization function \({{\,\textrm{v}\,}}:E \rightarrow L\) and a gap text based relation verbalization function \({{\,\textrm{v}\,}}_r :L^{\text {deg}(r)} \rightarrow L\) for each relation \(r\in R\).

Target Given the plainly verbalized KB, we build an interface for accessing knowledge formulated within the KB in a chat-like setting with submodel augmented system replies:

$$\begin{aligned} {{\,\textrm{t}\,}}:L \rightarrow L \times {\mathbb {P}}(\varPi ). \end{aligned}$$

Here, \({\mathbb {P}}(\varPi )\) is the powerset of \(\varPi\), the set of all parts of the KB, that the system should be able to explicitly reference. Except from entities these might, e.g., be relations or subsets of relations. For simplicity, we call \(\varPi\) concepts of our KB and for our implementation we use \(\varPi =E\), implying the full submodel induced by an entity subset. The goal for \({{\,\textrm{t}\,}}(l_\text {in}) = (l_\text {out},C_\text {out})\) is that \(C_\text {out}\) is the relevant subset of \(\varPi\) and \(l_\text {out}\) covers the relevant relations from R among entities in \(C_\text {out}\). Relevance can be defined based on any specific target use case but to keep our demonstration generic, the relevance shall be defined such as to describe entities in the given context. In general, \({{\,\textrm{t}\,}}(\cdot )\) does not need to be deterministic, but if it is, the goal for our demonstration may be described as non-trivial solution for:

$$\begin{aligned} {{\,\textrm{t}\,}}(l_\text {in}) = {{\,\textrm{t}\,}}(l_\text {out}). \end{aligned}$$

Example Let the target language L be the English language and the alphabet \(\varSigma\) be the union of the lower case and upper case letters, as well as special symbols like punctuation. Let further be the concept space \(\varPi =E\) and \(E=\{c_0, c_1, c_2, c_3, c_4, \dots \}\) with verbalizations

$$\begin{aligned} & {{\,\textrm{v}\,}}(c_0) = ``\text {balance}'',\\& {{\,\textrm{v}\,}}(c_1) = ``\text {Arterial BGA}'',\\{} & {{\,\textrm{v}\,}}(c_2)= ``\text {Insufficient oxygen supply: insufficient DO2}'',\\{} & {} {{\,\textrm{v}\,}}(c_3)= ``\text {Demand-adapted oxygen supply for all cells}'',\\{} & {} {{\,\textrm{v}\,}}(c_4)= ``\text {Ventilation}'' \end{aligned}$$

and the relations \(R=\{r_0, r_1, r_3, r_4, \dots \}\) where

$$\begin{aligned}{} & {} r_0=\{(c_0), (c_1), \dots \},\\{} & {} \quad r_1=\{(c_2), \dots \},\\{} & {} \quad r_2=\{(c_3),(c_3), \dots \},\\{} & {} \quad r_3=\{(c_0, c_2), (c_1, c_2), \dots \},\\ r_4=\{(c_2, c_3), \dots \} \end{aligned}$$

with verbalizations

$$\begin{aligned} {{\,\textrm{v}\,}}_{r_0}(\text {X})=\text {``}&\text {X is an action to consider for certain}&\\ {}&\text {failures.''},&\\ {{\,\textrm{v}\,}}_{r_1}(\text {X})=\text {``}&\text {X is a failure of a system element's}&\\&\text {function.''},&\\ {{\,\textrm{v}\,}}_{r_2}(\text {X})=\text {``}&\text {X is a system element's function.''},&\\ {{\,\textrm{v}\,}}_{r_3}(\text {X}, \text {Y})=\text {``}&\text {X is an action to consider for Y.''},&\\ {{\,\textrm{v}\,}}_{r_4}(\text {X}, \text {Y})=\text {``}&\text {X is a failure of Y.''}&\end{aligned}$$
Fig. 3
figure 3

Exemplary input and output. The blue box contains the input text \(l_\text {in}\) and the green box shows generated output text \(l_\text {out}\), right above a depiction of \(C_\text {out}\)

See Fig. 3 for exemplary system input and output.

4 Transformer Language Model

Since we use LMs for our framework AutoRAG and our example implementation called MCEA-LM in the following sections, we use this section to briefly introduce our techniques here. In general, an LM may be any function predicting words or word parts to form language, typically given some starting context. But since the parallelizability of transformer LMs allows for effective model scaling up to ground-breaking LLMs, we will focus on the transformer [35] language model here. Specifically, we only use the encoder as in [24]. Note that for simplicity we only sketch a rough overview of a basic implementation we use.

Tokenization: To feed language into the LM, the input is tokenized into words, word-parts and symbols of a pre-defined vocabulary. With the vocabulary \(\varSigma\), this translates a language string into a sequence of word identifiers (IDs) in \([0,1,\dots ,|\varSigma |-1]\) where \(|\cdot |\) denotes set cardinality.

Input Embedding: The whole sequence of n word IDs may be represented by an \(n\times |\varSigma |\) matrix \(T_\text {in}\) of stacked one-hot token vectors. This way, they can be translated to a sequence of vocabulary embeddings using a \(|\varSigma | \times d\) word embedding matrix W by multiplication. d is the embedding dimension size. Since the transformer model does not distinguish between the rows of this embedding, i.e., the position of each token in the sequence, we need to encode the positions separately. Analogously to vocabulary embeddings we create position embeddings from position IDs \(P_\text {in}\) and a position embedding matrix P. Position embeddings are simply added to the vocabulary embeddings:

$$\begin{aligned} E_\text {in} = T_\text {in}W + P_\text {in}P \end{aligned}$$

Transformer Encoder Layers: Up to this point, tokens are embedded individually. In a transformer layer, token embeddings are first mixed using multi-head attention \({{\,\textrm{mha}\,}}(\cdot )\) and passed through a token-wise feed-forward module \({{\,\textrm{ff}\,}}(\cdot )\), without changing embedding dimension d. Again, see [35] for details. The transformer features multiple of these layers, applied to the input sequentially. Let \(E_0 = E_\text {in}\), then the output of the l-th layer may be described by

$$\begin{aligned} E_l = {{\,\textrm{ff}\,}}({{\,\textrm{mha}\,}}(E_{l-1}) + E_{l-1}) + {{\,\textrm{mha}\,}}(E_{l-1}) + E_{l-1} \end{aligned}$$

After applying the l layers of the encoder stack, the embeddings are no longer independent of their context due to multi-head attention’s mixing. Thus we call \(E_l\) contextualized token embeddings. We use a causal language model, auto-regressively predicting tokens using unidirectional attention. In practice we restrict the token mixing such that each token can only mix with tokens on its left in the sequence.

Language Modeling: To translate contextualized token embeddings back to language, they are decoded back to word IDs. This can be done in different ways using token scores S which we obtain by multiplying the contextualized token embeddings with the token embedding matrix again: \(S = E_l W^T\). The actual decoding can be done by sampling from a score based probability distribution over the vocabulary, but for simplicity we greedily decode by taking the index of the largest token score, i.e., \(T_\text {out}\) is the column number of the largest value for each row in S. We train the language modeling using shifted labels, i.e., with special tokens “\(\texttt {<sep>}\)” and “\(\texttt {<eos>}\)” we mark the start of input tokens and the end of labels, respectively.

Sequence Classification: For sequence classification, e.g., to predict whether a concept is relevant for an input text, we append a special token “\(\texttt {<pool>}\)” to the input text. After contextualization, we use a pooling layer which drops all token embeddings except for the last, which originally was the added special token embedding. This contextualized embedding \(e_l\) can also be used for sequence embedding. For classification with a set of classes \(\varPi\), we multiply this row vector embedding with a \(|\varPi | \times d\) classification matrix C to get class scores \(s_c = e_l C^T\). In our implementation we use this score vector \(s_c\) for multi-class classification and finally classify each concept by picking top-k scores or individually by their scores using a threshold. Further the classification matrix C was defined here analogously to the word embedding W, since C is used in AutoRAG as concept input embedding as well. Building on the transformer architecture described in this section, we can now introduce our framework in detail.

5 AutoRAG

The problem described above is solvable in two steps:

\({{\,\textrm{r}\,}}:\):

\(L \rightarrow {\mathbb {P}}(\varPi )\), retrieve the relevant concepts and relations from a given KB,

\({{\,\textrm{d}\,}}:\):

\({\mathbb {P}}(\varPi ) \rightarrow L\), verbalize the retrieved content.

Composing these functions yields the solution

$$\begin{aligned} {{\,\textrm{t}\,}}(l_\text {in}) = ({{\,\textrm{d}\,}}({{\,\textrm{r}\,}}(l_\text {in})), {{\,\textrm{r}\,}}(l_\text {in})). \end{aligned}$$

Without special emphasis on eloquence, \({{\,\textrm{d}\,}}(\cdot )\) can be implemented straightforwardly by translating each concept and each relation to text using provided verbalization functions \({{\,\textrm{v}\,}}(\cdot )\) and \({{\,\textrm{v}\,}}_r(\cdot )\). Given these functions produce meaningful text, this translation approach will work properly in cases of simple facts like isolated concepts, or concepts in a single relation resulting in just a single text. If direct verbalization results in multiple text snippets, we can present them in arbitrary order. To slightly improve eloquence and arrange the facts in a more meaningful order, we ask an LLM trained for instruction following, with an appropriate prompt to naturally arrange the given facts.

For r, we need a text classifier, for binary concept-wise relevance discrimination. We add a special token, which, after contextualization, is passed through a classification layer.

Together, these approaches make up a working solution, but require two LMs, one with special fine-tuning for classification and another one for rearranging text. Given that we train an LM for our text classifier \({{\,\textrm{r}\,}}(\cdot )\) anyway, we can train for generation in parallel, using samples from the straightforward approach initially described for \({{\,\textrm{d}\,}}(\cdot )\) as training samples. Since the content output domain for generation is even limited to covering explicitly given knowledge of our KB, we can reduce the required model size in comparison to model used for general instruction following. Using a general LLM as trainer only, this allows to train an LM for accessing the formalized knowledge in a chat like environment on a standard consumer computer with reasonable latency. The output consists of both the generated text \(l_\text {out}\) and the determined concept subset \(C_\text {out}\), which can be depicted as labelled nodes in a graph, extracted from our trusted KB.

5.1 Language Model Architecture

Our starting point is a pre-trained LM such as GPT [24] and we add a special vocabulary token for each concept contained in the given KB which the LM should explicitly be able to work with. These might be anything, entities or relations, but in our implementation we use entities only, keeping the size of our extra vocabulary small.Footnote 4 Additionally, we add a pooling layer consisting of a single linear layer, analogous to the used language modeling layer. Instead of sharing the vocabulary embedding parameters, a concept pooling layer shares concept embedding weights. An overview of the architecture can be found in Fig. 4.

Fig. 4
figure 4

Architecture overview. Concepts and words, as well as special tokens are embedded in the same input dimension. The combined sequence embedding is then equipped with position information and the result is passed through the transformer layer stack. Starting with the \(\texttt {<sep>}\) token followed by contextualized token embeddings, the language modeling layer predicts respective next word IDs. The concept classification layer predicts related concepts from contextualized \(\texttt {<pool>}\) tokens only. Note that for concept classification concept_ids,\(\texttt {<sep>}\) is omitted from the input and the first token starts at position 0

5.2 Training

Our training data must contain both, text for query/answer (\(l_\text {in}\)/\(l_\text {out}\)) and IDs for relevant KB extracts. The training target is twofold:

  1. (1)

    concept association, and

  2. (2)

    language generation.

Given a query, one part of the training goal is to retrieve parts from the KB, relevant for answering the request. For allowing the query to contain relevant task information which cannot be modeled using the KB, the input context for language generation training may consist of the concatenation of query and associated concepts. If the KB can completely capture the relevant task information, then we can omit the query and train generation from associated concepts only. This leads to the text and concept auto-encoding task we implemented as demonstration, where the target text is used as query as well: \(l_\text {in} = l_\text {out}\). It demonstrates how a reasonably sized KB with basic verbalization functions can be used to generate a large training set by sampling subgraphs of the KB’s concept nodes and relation-induced edges.

5.3 Comparison to Classical RAG

In contrast to AutoRAG, a classical RAG pipeline can be separated into two parts, where the first part takes in the query and outputs a retrieval augmented query, containing the relevant information for the generation of the final informed answer. In AutoRAG, the retrieval is a reformulation of the query using the model’s own internal vocabulary. Classical RAG could for example first retrieve relevant facts from the set of all explicit facts in the KB, and produce the following prompt for the LM to generate its answer from: “How to check oxygen supply? \(\texttt {<cont>}\) Balance is an action to consider for certain failures. Arterial BGA is an action to consider for certain failures. [...] Insufficient oxygen supply: insufficient DO2 Demand-adapted oxygen supply for all cells is a failure of Demand-adapted oxygen supply for all cells. \(\texttt {<answer>}''\). Recall Sect. 3 for a proper impression on how long a prompt can become by naively collecting relevant facts, even in toy sized examples. AutoRAG, in comparison, simply prompts the question \(l_\text {in}\), i.e., “How to check oxygen supply?\(\texttt {<pool>}''\) upon which it analyzes its output concerning the \(\texttt {<pool>}\) token and, without translating concept tokens to sequences of basic vocabulary tokens, re prompts itself with \((l_\text {in},C_\text {out})\), i.e., “How to check oxygen supply?\(\texttt {<pool>}\) \(\texttt {<c0>}\) \(\texttt {<c1>}\) \(\texttt {<c2>}\) \(\texttt {<c3>}\) \(\texttt {<sep>}\)” to generate its final answer \(l_\text {out}\) and return \((l_\text {out},C_\text {out})\).

6 Example Implementation

In our implementation, we fine-tune gpt2-medium [24] using HuggingFace’s Transformers [37]. In this section, we outline our training data generation and training first, followed by our empirical evaluation, a comparison to a RAG alternative, and the implementation of an example question answering service.

6.1 Training Data

Using the KB and basic verbalization functions, we could theoretically generate \(2^{|\varPi |}\) unique samples, each comprising a unique concept subset with respective verbalization. However, many of the samples will be practically ineligible: Big concept subsets can easily become overly complex and sets without relations among covered concepts might lack desirable meaning. Therefore, we design a sampling mechanism starting with one or two independently sampled concepts and then sample from relations they appear in to iteratively find randomly connected concepts. Again, the exact implementation here is subject to a special use case and is of minor importance for the general framework. Our implementation is guided by the idea that concepts should mostly be accompanied by connected concepts to promote prediction of cohesive components rather than randomly scattered concepts. On the other hand, samples should at least cover some independent components to facilitate non-connectedness. After a concept subset is found, we list all relations among these concepts in natural language using basic verbalization functions and use open_llama_7b_v2_med_instructFootnote 5 to translate the fact list into fluent text, instructing it not to add additional facts. Evaluation and test sets consist of randomly generated samples. Our first training set comprises 10K samples of which only one half is generated randomly while the other half contains all basic facts. Each train set is iterated ten times. The following training sets of 10K samples only contain a fixed number of basic facts and are filled up with random samples.

6.2 Training

We train using two objectives:

Generation:

We train the LM for auto-regressive, conditional language generation using left-shifted tokens as labels, while presenting a prefix of relevant concept embeddings. We use CrossEntropyLossFootnote 6 on output token predictions.

Concept classification:

We present the input and append a classification token which we use in the pooling layer to associate concepts, optimized using binary cross entropy loss (BCEWithLogitsLossFootnote 7).

To minimize padding, we build batches by collecting samples of similar length. Each batch may contain up to 2310 tokens resulting in a variable number of samples per batch. Batches are used for training in shuffled order. Each dataset of 10K samples resembles roughly 261 batches. We employ the AdamWFootnote 8 optimizer (\(\beta _1=0.9\), \(\beta _2=0.99\), \(\texttt {weight\_decay}=0.1\)), a linear learning rate schedule with a peak learning rate of \(5\times 10^{-4}\) and 10% warm-up steps and train for 52,200 steps without gradient accumulation. Let \({\mathcal {V}}\) be the vocabulary loss and \({\mathcal {C}}\) the concept loss, while \({\mathcal {L}}_t\) is computed on training data and \({\mathcal {L}}_v\) on a set of 100 random validation samples. At each epoch, we compute performance and save a checkpoint if

$$\begin{aligned} {\mathcal {B}} = 2({\mathcal {V}}_t + {\mathcal {C}}_t) + ({\mathcal {V}}_v + {\mathcal {C}}_v) \end{aligned}$$

based on the last training step lies below the previous minimum.

Fig. 5
figure 5

Training statistics of our MCEA-LM, measured after each epoch. \({\mathcal {B}}\) is our checkpoint selection criterion, the weighted sum of training and evaluation loss. The plot only contains data up to the point where our best checkpoint was saved

Fig. 6
figure 6

Training statistics of our MCEA-LM. Both \({\mathcal {V}}\) and \({\mathcal {C}}\) are mean-aggregated over the batch dimension, while \({\mathcal {V}}\) is mean-aggregated over the prediction dimension as well, but \({\mathcal {C}}\) is summed over the prediction dimensions. The plots cover all training steps. For a smoother plot, batches of 52 steps are aggregated. The darker line shows the average while lighter outlines display minimum and maximum, respectively

Figure 5 provides an overview of training and evaluation statistics and Fig. 6 shows training loss in greater detail, plotting vocabulary and classification loss separately. The plots clearly show increased loss on train set changes, affecting both train and evaluation performance, probably introduced by slightly less effective gradient updates due to relatively high confusion. The best performance gradually improves with each dataset change until the best score of \({\mathcal {B}} = 0.27\) is reached with 45,936 train steps.

6.3 Evaluation

For evaluation, we generate 10K new samples. In addition to the text generated by the trainer LLM using the same prompt template as for training, we equip each sample with a text generated with focus on brevity (see Table 1) and with the provided input for the trainer LLM, i.e., the text of concatenating all represented basic facts. We measure the model’s classification and generation performance in the following settings:

Train-like:

Classify from the text generated using the same prompt template as for training, measuring training classification performance.

Shifted style:

Classification based on prompt template with focus on short explanations. Although they may appear a little more fluent than the train text, they are still quite similar in style. It is conceivable to add additional style hints in the instruction text, however respective prompts appear to be prone to omitting relevant facts from the input.

Basic facts:

Text obtained by concatenating all represented basic fact verbalizations.

Generated text:

Present the sample’s concepts to the LM, generate a textual description, and classify concepts from the result.

Shuffled generated:

The same as generated text from above, except for the sample’s concepts being presented in a random order.

Table 1 Prompt template with instructions used for (1) generating training, and (2) additional evaluation text samples using the trainer LLM
Table 2 Randomly picked evaluation sample and generated texts

Table 2 is an example, randomly taken out of the evaluation process. It shows all used text types for comparison. During sample generation, after determining the concept subset, sample facts are verbalized following some deterministic rules. Thus, the order given in the sample is not random and contains hints on combination and order of represented relations. However, due to binary concept-wise classification, the classification only produces an unordered set of concepts. The order of concepts could be used purposefully here, e.g., to guide generation focus by sorting concepts by score. As we do not implement such features due to our generic approach without further specifying a use case, we assume typical retrieved concepts to be in random order.

Fig. 7
figure 7

MCEA-LM concept classification receiver operating characteristic (ROC) curves. k is top-k and \(\theta\) is threshold-based classification. T uses test texts, P uses the plain list of basic facts, G uses generated text and G\(_s\) uses generated text after shuffling sample’s concepts. Classification from train-like text is plotted in red

Fig. 8
figure 8

MCEA-LM concept classification F1-Score plots. T uses test texts, P uses the plain list of basic facts, G uses generated text and G\(_s\) uses generated text after shuffling sample’s concepts. Classification from train-like text is plotted in red

As might be seen in Figs. 7 and 8, plain and shuffled generation perform almost identical in our setting. Figure 7 shows false-positive rates and true-positive rates for each setup with different thresholds and numbers k of concepts to select. While this plot shows the general effectiveness of the classifiers in detail, it cannot show which specific thresholds and values of k perform best. Clearly, for any setup, reducing the threshold—or increasing k, respectively—leads to both higher true-positive and false-positive rates. For the direct relation of threshold or number of selected concepts to classification performance, Fig. 8 gives a better overview. The F1 score is similar to the Jaccard index for measuring target and retrieved concept set similarity. Although F1 puts greater focus on true-positives than false-positives and false-negatives, both scores are 1 for perfectly matching sets and 0 if there is no overlap. Both plots, Figs. 7 and 8, indicate that the closer the text is to simply listing facts, the better the classification is. For basic fact listing text, the classification is almost perfect with \(\text {AUC}=1.00\). While still effective with \(\text {AUC}\ge 0.96\), generated samples perform worst in our comparison. Table 9, a histogram over classification performance for individual concepts, gives an impression on how different the classification works for different concepts. It shows that most of the concepts are classified with an F1 score of at least 0.8, but there are some particularly hard concepts as well. An analysis like this could be used as bias for sampling starting concepts in the generation of new training samples. This would boost hard samples during training and would avoid over-representation of high degree concepts, i.e., those concepts with high number of connections which therefore appear often in our simplistic subset sampling.

Fig. 9
figure 9

MCEA-LM concept classification of individual concepts using threshold\(-\)0.5 based classification. The colored bars show scores for individual description texts, while the black bars use scores averaged over all texts for each individual concept

Table 3 The top-10 concept scores for a short test text, “Transport breathing gas is a function of left main bronchus, which is a sub-function of segmental bronchi

Comparing top-k and threshold classification, thresholds perform better, however, in practice top-k becomes a strong competitor: While our generated test texts are still fairly similar, styles of user-written inputs may diverge arbitrarily. See Table 3 for an example of how a different input style drastically reduces concept scores. While top-k still performs well, a threshold working well on our generated test samples can easily return no concept at all. Therefore careful threshold adaptation is required in practice, using thresholds in a range where top-k perform similarly, so in the following we use top-k for simplicity.

6.4 RAG Comparison

For an impression of AutoRAG’s practical benefit, we implement a more classical RAG solution for comparison. Since there are many ways we could implement this, ranging from completely separate modules for retrieval and answer generation respectively to end-to-end trained solutions, we decided to stay close to the AutoRAG implementation to highlight the main difference in retrieval and generation: AutoRAG uses special embeddings for associating facts of the KB also for representing these facts. Therefore, we equip the RAG version with an additional lookup table of KB facts to compose complete context using basic input vocabulary only. Naively, instead of associating concepts only and collect all relevant facts for prompts, one could as well associate each fact individually (resulting in a total of 3823 facts instead of 912 concepts to associate in our case). But since direct fact association did not achieve competitive performance in our case, we omit further training details here. It did achieve an AUC of at least 0.95 for all input text versions but the F1-Score stayed below 0.4 for all values of \(\theta\) and k. Instead, we implemented the retrieval by associating KB concepts and returning all facts concerning relations among associated concepts. We trained the RAG version for all random samples generated for our AutoRAG training, using the same maximum token number per batch. Since the retrieved context is not compressed in the RAG case, training samples were significantly longer and had to be cropped due to gpt2-medium’s maximum sequence length of 1024 more often. As further consequence, each training set of 10K samples did no longer fit in an average of 261 batches and filled about 840 batches. Therefore, training the RAG model on the same data for the same number of iterations took more than twice as many training steps. The training behavior shows the same characteristics as for AutoRAG, with jumps in training loss on dataset changes. Overall, the loss levelled slightly worse than for AutoRAG, with concept verbalization loss around 0.25, and an association loss of about 0.5. The best score of \({\mathcal {B}}_\text {RAG} = 2.99\) was reached with 123,479 steps. Since the evaluation involves sampling answers, taking significantly longer for transformers with standard attention mechanism if the context becomes longer, we only evaluate and compare the RAG solution using 1K of the 10K evaluation samples. See Fig. 10 for a classification performance comparison.

Fig. 10
figure 10

Classification performance comparison of our AutoRAG implementation and our RAG alternative

It shows that both models perform almost identical in Top-k classification of train-like texts and AutoRAG performs slightly better in threshold classification. In Top-k classification from model generated text, RAG appears to be slightly superior. The fact that this trend is not mirrored in threshold classification might be due to style differences between train-like and generated texts, potentially causing noisy scores in negatives and high variance for positives. A closer inspection of individual examples indeed shows differences in generation styles.

Table 4 Comparison train-like texts and respective model generated versions

Table 4 shows an example where the sampled facts redundantly cover both directed versions of the sub-element relations. For this sample, the trainer language model, instructed to keep all mentioned facts from the plain fact text, does treat them as different facts, but it puts them in the same direction and falsely indicates that there are two different sub-elements. But since the ordered combination of these two system elements is as well a basic training fact, AutoRAG settled for the text seen during training. In contrast, RAG does not appear to produce any meaningful conjunctions nor does it recognize it as a frequently seen training sample, it simply copies and repeats facts from its retrieved context. Therefore, the output style RAG produces is close to plain fact text and hence a basic RAG approach is not a meaningful alternative to AutoRAG in our scenario, since we could simply create the plain text using verbalization functions without any answer generating language model.

6.5 Query Answering Service

Since our training does not include any specific target task, classification output is just biased to loosely resemble connected subsets. We assume a user to ask for information of concepts which in general are somehow connected. Therefore, after associating concepts from our KB, we check whether they are connected. If they are not directly connected, we search for short connections using depth-limited breadth-first search and add respective concepts if a path is found. Afterwards, we generate verbalizations for independent components separately, omitting separated concepts. If no connections are found, we present verbalizations of isolated concepts and ask the user to reformulate their request for better results. Additionally, we present the associated concepts, visualized using Graphviz.Footnote 9 Recall Fig. 2 for examples.

7 Discussion

AutoRAG is a framework combining an LM with a knowledge base, by introducing the knowledge base’s concepts as additional prefix vocabulary, further used in a concept classification head. This allows us to build a knowledge-bound LM, associating known concepts with input text and explaining formally modeled, related knowledge.

From the comparison of basic RAG and AutoRAG in Sect. 5.3 the relation to RECITE [31] becomes apparent, since the authors propose to let the model write personal comments on queries as retrieval results. The difference to AutoRAG is, however, that we keep the trained facts explicit, using our special vocabulary. Apart from the fact that it reduces context size due to fixed size embeddings, this effectively prevents from intermediate hallucinations. While RECITE might completely invent facts, AutoRAG can at worst associate unrelated facts. Therefore, recitations would have to be checked for both correctness and relatedness while for AutoRAG correctness is always given (under the assumption of a correct KB) and relatedness can, to some extend, be checked automatically using the relation structure of the KB (assuming a center of interest can correctly be identified based on the association weights).

Since our training aims to compress concept and relation descriptions by explaining information from the prefix, it is questionable whether basic simple next token prediction is an appropriate pre-training objective: Even though we show that we are able to fine-tune such an LM effectively, training and final performance might benefit even more from de-noising auto-encoding based pre-training tasks. Further, the choice of independent binary concept classification is not obvious at all. One could as well use auto-regressive concept sequence generation, however, effective training would require a meaningful concept order.


Limitations: The purpose of the implementation in this work is to demonstrate our approach rather than to quantitatively assess the superiority of AutoRAG over classical RAG. The results do show slight improvement in both language modeling and classification, which to some extend is indeed expected due to decreased overall context length and the exploitation of secondary embedding training for classification respectively. However, these improvements have not been expected to show noticeable impact in our implementation due to maximum context length of 1024 and a knowledge base transforming into a rather small retrieval database. A proper qualitative comparison should ensure consistent training, avoiding possible disturbance from different batch sizes or number of training steps. Further, each setup should be trained multiple times, using different randomly composed training sets to compare average performance.

The reduction of hallucinations, has not been directly evaluated either. Apart from fair training, this would require extensive manual evaluation which was out of the scope of this work. As a matter of fact our AutoRAG implementation does output wrong statements which might be caused by erroneous training samples. At least a misleading instance was shown in Table 4. This could be improved by using a better trainer LLM or improved generation task prompts. But, as argued, even with perfect training data we could not exclude hallucinations in answer sampling or misclassification of concepts.

As feasible evaluation tool, we propose graphical depictions of a KB which we consider trusted. Apart from providing multiple examples to eventually convince the reader, we did not further discuss whether this tool is actually helpful. Properly answering this question would require sufficient feedback from human evaluation, which again, was out of the scope of this work.

Although the knowledge base used in our implementation is modeled as MDP, special semantics, e.g., such as possible implications of certain failure occurrences, are not directly covered by the verbalizations. These semantic details are intended to make up rich training data in order to convey some intuition about model-implied reasoning. But since our training data was generated independently from the semantics, possible benefits from semantically rich KBs could not be evaluated here.

8 Conclusion

We propose AutoRAG, a framework for training an LM to tightly work with a pre-defined KB, outputting a textual answer combined with a symbolic representation. With our implementation on an example use case compared to a RAG alternative, we qualitatively show improvements in terms of required context length implying more efficient training and inference, as well as improved context understanding due to shorter context.

Whether these improvements actually lead to a reduction in hallucinations could not be answered by our experiments. However, we believe that while hallucinations and their origin in general remain an interesting topic to illuminate further, we need solutions to validate generated system output for which AutoRAG indicates a direction.

Finally, our implementation, especially the generation of training data, does not yet actively use the semantics of the formally modeled basis of our KB. We simply use descriptions as verbalizations and generate samples without focus on any specific target task or content focus. Instead of sampling basic concepts, one could sample a specific state for the MDP, e.g., evidence for covered random variables specifying a particular failure state, and algorithmically compute optimal solutions or promising partial plans according to the semantics of the used MCEA model. Thus, we could not only train to verbalize given concepts but also utilize insights from running computationally complex inference algorithms to output helpful actions that should be conducted to remedy the failure state. We leave active use of modeled semantics to generate meaningful and task oriented training data for future work.