1. Introduction
The benefits of synthetic chemicals in daily life are undeniable; however, their intentional and unintentional release into the environment has been a significant risk factor for human health [
1,
2]. To date, millions of chemical substances have been identified [
3], and health risk assessment is a crucial foundation for the regulation of these substances [
4,
5]. Evaluating the risks of chemicals requires not only an understanding of their environmental and biological behaviors, but also knowledge of their toxicity and various other factors [
6,
7]. In recent years, the rapid growth of knowledge has led to an influx in new information, and this exponential increase in knowledge has placed a substantial burden on the knowledge updates required by managers and professionals. Consequently, the demand for effective knowledge retrieval has sharply increased.
In knowledge-intensive tasks, the process of knowledge retrieval plays a crucial role [
8]. This involves accurately locating information relevant to specific questions within vast knowledge bases. With the continuous advancements in artificial intelligence (AI) technologies, particularly those based on large language models (LLMs), the application of knowledge retrieval in vertical domain question-answering (Q&A) tasks is becoming increasingly widespread [
9,
10]. The essence of Q&A tasks is to extract information from extensive text resources and generate accurate and relevant responses.
Despite significant progress in the field of LLMs, their application still faces several challenges. First, the textual knowledge acquired by LLMs through a large number of fixed parameters not only incurs high training costs, but also struggles to update with the latest knowledge from the external world [
11], leading to difficulties in adapting to new information over time. Additionally, LLMs face credibility issues, such as generating hallucinations and factual inaccuracies [
12]. Particularly, hallucination refers to the phenomenon where LLMs generate factually incorrect or nonsensical outputs. These unreliable outputs pose significant risks when deploying LLMs in real-world applications. Existing research indicates that the content generated by LLMs is often unreliable and poses various risks in many cases [
13].
To address the challenges mentioned above, researchers have proposed Retrieval-Augmented Generation (RAG), a new paradigm that enhances LLMs by integrating external knowledge sources [
14]. To illustrate how the RAG technique can be applied in LLMs for developing a Q&A system related to human health risks, this manuscript is organized as follows:
Section 2 introduces the related work. Based on the summary in
Section 2,
Section 3 presents the research gap, aims, and objectives.
Section 4 details the materials and methods, while
Section 5 discusses the results. Finally,
Section 6 summarizes the conclusions.
2. Related Work
RAG employs a collaborative methodology that combines information retrieval mechanisms with the contextual learning capabilities of LLMs, utilizing both fixed-parameter LLMs and non-fixed-parameter data storage (such as text blocks in a knowledge base). In this paradigm, user queries first connect with an external knowledge base, using search algorithms to retrieve relevant documents [
15]. These documents are then incorporated into the LLM’s prompts, providing additional context for generating responses. A key advantage of RAG is that it removes the need to retrain the LLM for specific tasks, and developers can easily improve the accuracy of model outputs by augmenting the external knowledge base. The RAG approach has been shown to effectively enable contextual learning from retrieved documents, significantly reducing the risk of generating hallucinated content [
16].
With the rise of models like ChatGPT, RAG technology has rapidly developed. Recently, a series of studies have developed domain-specific question-answering systems that integrate specialized knowledge bases. These systems have significantly improved their ability to handle interdisciplinary issues through a modular design approach. For instance, Liu et al. [
17] addressed the exponential growth of logical form candidates through linearly growing primitives and comparative ranking methods, thereby achieving efficient, composable, and zero-shot question answering on knowledge bases and databases. Additionally, RnG-KBQA tackles coverage challenges and enhances generalization capability through comparative ranking of candidate logical forms and a generative model based on questions and top-ranking candidates [
17]. These advancements indicate that RAG technology has immense potential in specialized question-answering systems, effectively tackling complex knowledge-intensive tasks.
To date, LLMs have been widely applied in a large number of research fields. For instance, prompt engineering has guided ChatGPT to automatically extract synthesis conditions for metal–organic frameworks from the scientific literature [
18]. In the medical question-answering domain, BiomedRAG integrates a retrieval-augmented model with the biomedical field, directly inputting retrieved text blocks into the LLMs, enabling the LLMs to perform exceptionally well on various biomedical NLP tasks [
19]. In the legal domain, Louis et al. [
20] proposed an end-to-end approach that employs a retrieval-reading process to provide comprehensive answers to any legal question. In the open question-answering space, PaperQA combines retrieval augmentation and AI agents to address questions about scientific literature, demonstrating superior performance on current scientific question-answering benchmarks compared to existing LLMs [
21].
In the field of AI for environmental science, intelligent assistants based on LLMs are transforming traditional research processes. Zhu et al. [
22] noted that ChatGPT’s popularity stems from its ability to provide quick, informative, and seemingly “intelligent” answers to a wide variety of questions. The authors summarized several beneficial areas, including writing improvement, key point and theme identification, sequential information retrieval, as well as coding, debugging, and syntax explanation. However, they also cautioned researchers about potential issues such as the generation of fabricated information, the lack of updated domain knowledge, insufficient accountability in decision-making, and the opportunity cost associated with relying on ChatGPT.
Furthermore, LLM-driven systems can accelerate research processes through autonomous execution of tasks, showing high efficiency, particularly in the construction of adverse outcome pathways [
23]. These systems can quickly extract key information from the literature, build causal networks, align closely with expert-validated findings, and provide more in-depth insights. While there are current limitations, ongoing advancements in AI technology and collaboration between AI systems and human experts show promise for the future of AOP construction. By harnessing the strengths of LLMs, we can improve our understanding of the adverse effects of environmental pollution and better protect public health through more effective risk assessment and regulatory decision-making.
Xu et al. [
24] summarized the use of generative artificial intelligence in environmental science and engineering. In particular, the authors proposed some applications, such as designing new treatment processes, developing environmental models, and evaluating environmental policies. Meanwhile, the authors mentioned that the significant challenges include obtaining and creating specialized datasets prior to model construction, as well as ensuring the accuracy of outputs throughout the model development and usage phases.
A recent case highlights the role of LLMs and the Q&A system in revolutionizing water resource management, research, and policymaking [
25]. After posing several questions to ChatGPT, the author concluded that integrating AI, particularly deep learning and advanced language models like ChatGPT, offers transformative opportunities in this field. Key points include enhanced understanding, democratization of knowledge, decision-making levels, sustainability, and vast potential.
However, most studies only pointed out that LLMs could be widely applied and useful in environmental science, and few practices have already been established. A case [
26] assesses two generative pretrained transformer (GPT) models and five fine-tuned models (FTMs) using a specialized question-answering dataset, focusing on relevance, factuality, format, richness, difficulty, and domain topics. Results reveal that GPT-4 scored 0.644 in relevance and 0.791 in factuality across 286 questions, with scores dropping below 0.5 for more challenging questions, indicating a need for improvement. In contrast, FTMs with larger datasets maintained factual accuracy, emphasizing the importance of high-quality training materials. The study highlights issues of inaccuracies and format problems tied to overtraining and catastrophic interference, and uses expert-level textbooks to enhance LLM performance, paving the way for the development of more robust domain-specific LLMs for environmental applications.
Saeid et al. [
27] enhanced GPT-4 by integrating access to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR6). The conversational AI prototype, accessible at
www.chatclimate.ai, is designed to tackle challenging questions through three distinct configurations: GPT-4, ChatClimate, and Hybrid ChatClimate. Expert evaluations of the responses generated by these models indicate that the Hybrid ChatClimate AI assistant provides significantly more accurate answers.
Ren et al. [
28] trained an LLM to become a hydrology expert, termed as WaterGPT, which is utilized in three primary domains: data processing and analysis, intelligent decision-making support, and interdisciplinary information integration. The model has demonstrated promising results, particularly through its careful segmentation of training data during the supervised fine-tuning phase. These data are derived from real-world sources and are annotated with high precision, utilizing both manual techniques and annotations from GPT-series models. The data are categorized into four distinct types: knowledge-based, task-oriented, negative samples, and multi-turn dialogues.
Liang et al. [
29] developed a framework utilizing GPT-based text mining to extract information related to oxidative stress tests. This framework encompasses several key components: data collection, text preprocessing, prompt engineering, and performance evaluation procedures. The authors extracted a total of 17,780 relevant records from 7166 articles, encompassing 2558 unique compounds. Interestingly, over the past two decades, there has been a noticeable increase in interest regarding oxidative stress. This research led to the establishment of a comprehensive list of known prooxidants (n = 1416) and antioxidants (n = 1102), with the primary chemical categories for prooxidants being pharmaceuticals, pesticides, and metals, while pharmaceuticals and flavonoids were predominant among antioxidants.
Recently, scholars from Peking University developed a web app called Water Scholar (
https://www.waterscholar.com/ (accessed on 5 January 2025)). This project is a free research assistant application for water science, based on the Wenxin large model. The app offers several features, including the ability to search for literature in the field of water, generate literature reviews, answer professional knowledge questions, and create citation lists.
4. Methods and Materials
4.1. Study Framework
To develop a knowledge-based LLM system using retrieval-augmented generation, this study was divided into three parts (
Figure 1). (1) The generation of knowledge question–answer pairs: This step involved collecting and systematically organizing literature in the field of human health risk assessment to extract key information and form question–answer pairs. These question–answer pairs not only provide a benchmark for the large language model, but also help reveal its limitations in this domain. (2) The development of a Q&A system integrating multiple knowledge base retrievals: The development will be based on a thorough analysis of existing knowledge retrieval technologies (as stated in the Introduction) to ensure precise and comprehensive answers to relevant questions. The system helps efficiently answer specific questions posed by non-experts, significantly reducing their time costs in re-learning and information retrieval in interdisciplinary research. (3) Performance evaluation: by comparing the accuracy of different strategies in answering scientific questions, this will provide a scientific basis for further optimization and improvement of the Q&A system.
4.2. The Generation of Question–Answer Pairs
The generation of question–answer pairs consisted of three steps: literature retrieval, information extraction, and question–answer pairs generation. Literature retrieval involved keyword searches on the PubMed platform and Scopus, categorizing the research field of human health risk assessment into six submodules: analytical method, transport and fate, environment exposure, toxicokinetics, toxicity, and human health risk.
To ensure the scientific rigor and comprehensiveness of the literature retrieval, we referenced both classic literature and the latest research findings in the relevant field when selecting keywords. We conducted a precise screening based on the core themes and characteristics of subfields in human health risk assessment, ensuring coverage of key concepts and research directions in the field. Different keywords were used for searches, with the keyword table presented in
Supplementary Material Table S1, and the categorization method and keyword selection criteria detailed in
Supplementary Material Text S1. Subsequently, both manual and automated scripts were employed to download PDF documents from PubMed and Scopus. Through the PubMed API and web scraping methods, metadata including titles, abstracts, and publication years were gathered as the data foundation for this study [
30]. It is important to note that our research does not exhaust all literature. Here, we attempted to use sufficient documents to build a knowledge vector database, which will further support the question-answering system. Exhausting all literature in the field would place a considerable computational burden on the server. Therefore, we set a literature cap of 500 for the analysis methods field and 200 for other fields.
To reduce the hallucination issues of LLMs in various domains, this study designed an automated question–answer pair generation process, significantly improving efficiency compared to traditional manual annotation methods. As shown in
Figure 2, this process consists of three main steps: first, leveraging the LLM’s contextual learning ability and appropriate prompt engineering, the literature were input into the model, which was transformed as the literature vector indexing. The aim of this step is to generate three different questions. Next, for each question, a multi-index retrieval-based LLM generated answers from the corresponding literature, annotating the source to ensure accuracy and verifiability. Finally, the system evaluated and selected the highest-quality question–answer pair among the three. Following the PubMedQA approach [
31,
32], the question–answer pairs were saved in a structure that includes the question, answer content, source, and literature DOI. This automated process not only enhances the efficiency of question–answer pair generation, but also ensures high quality and practicality of the information through systematic screening.
4.3. Naive Retrieval-Augmented Generation-Based Question-Answering System
Naive Retrieval-Augmented Generation (Naive RAG) is one of the earliest RAG methods [
33], employing a traditional “retrieve-read” framework. In this framework, data were first indexed, then retrieval was performed based on user queries, and finally, the retrieved information was used as context to generate responses. This framework features a simple yet representative retrieval-augmented structure. This makes it widely used for comparative evaluations against more complex retrieval-augmented techniques. As a fundamental framework, Naive RAG provides a unified reference standard that aids in assessing the improvements of new methods in both retrieval performance and generation quality. The main drawbacks of Naive RAG include low retrieval quality, limited quality of generated responses, and potential loss of context when integrating retrieved information.
As shown in
Figure 3, the construction process of a Q&A system based on RAG is divided into two parts: building a vector knowledge base and implementing the Q&A process. The vector knowledge base construction involved document segmentation and vectorization of chunks. Since the collected literature was presented in
PDF format, which cannot be directly read by computers, we used the
PyPDF library within the
LangChain framework to convert PDF documents into string format [
34].
PyPDF is a widely used Python library for processing PDF files, capable of various operations related to PDF documents, such as reading, splitting, merging, cropping, and converting PDF pages, as well as extracting text, images, and metadata. After extracting the text, we used
LangChain’s fixed-length text splitting method to segment the literature into blocks of 1000 characters each. Once the documents were chunked, the resulting sub-documents were required to be vectorized, a step that transforms text into high-dimensional vectors, completed by
OpenAI’s
text-embedding-3-large model, which captures semantic information and represents it as fixed-length vectors.
In the Q&A process, the vectorization of user questions was similarly involved, using the same embedding model for consistent vector processing. The distance between vectors can represent the semantic similarity of two text segments. Based on this vector-matching principle, we can compute the cosine similarity between vectors to find the documents in the knowledge base that are semantically closest to the user’s question. This document was used as the context for the question and was input along with the question into the prompt template, leveraging the context-learning capabilities of the LLM to improve answering effectiveness.
This process was mathematically represented as follows: given a user question
q and a set of document contents
, using the embedding model
and cosine matching algorithm
, the top
K retained document contents
context_K:
Then, the answer would be generated by the
LLM:
4.4. Advanced Retrieval-Augmented Generation Question-Answering System
To overcome the drawbacks of Naive RAG, Advanced RAG presented in this study introduces more complex techniques such as query rewriting, document reordering, and prompt summarization, aimed at improving retrieval relevance and the quality of generated text [
35]. In summary, Advanced RAG optimizes data indexing through pre-retrieval and post-retrieval strategies, and enhances the quality of the retrieval process via techniques like fine-grained segmentation and reordering.
Semantic vector matching often encounters failures due to sometimes unclear semantic relationships between questions and document content. For example, a study focusing on per- and polyfluoroalkyl substances (PFASs) may suggest that altering agricultural practices can reduce PFASs’ environmental impact, while the question could specifically inquire about PFASs’ effects on water quality. In such cases, direct semantic matching may not accurately retrieve the most relevant information, necessitating deeper understanding and analysis. Additionally, we assumed that solving certain questions requires information from the literature; however, current technology frequently struggles to fully and accurately identify PDF-formatted documents, leading to noise that may misalign with the original text, adversely affecting answer generation.
To tackle these issues, this research designs an advanced retrieval-augmented framework (termed as Advanced RAG) that adds two modules—dual-layer retrieval and clue extraction—to Naive RAG. Before retrieving chunked documents in Naive RAG, we incorporate vector matching of questions and document summaries. To mitigate the impact of noise on the question-answering effectiveness, we added an information extraction module based on LLMs to gather question-relevant clues from the retrieved chunked documents. The framework (
Figure 4) mainly consists of four processes: paper search, chunk search, gather evidence, and answer the question based on evidence.
Paper search: The goal of this step is to identify the literature most relevant to the user’s question. First, we vectorized the question
q using the embedding model
, which captured the semantic features of the question. For this, we used the
text-embedding-3-large model provided by
OpenAI. Then, we matched this vector with the literature abstracts
in a pre-constructed vector database using cosine similarity
, resulting in a collection of literature references
. This can be expressed mathematically as follows:
Chunk search: In this step, we aimed to retrieve the most relevant content segments from the literature collection obtained in the previous step. Similar to previous literature retrieval, we also used vector matching techniques, but here we operated at a finer granularity, specifically on the document paragraphs, referred to as
. Using the literature collection gathered in the previous step as a filtering criterion, we searched for the content most relevant to the question from a vector database constructed based on the literature content. In this step, we employed maximum marginal relevance search
for vector matching, which ensures that the retrieved results are not only highly relevant to the question, but also diverse from one another, thereby enhancing both the diversity and relevance of the retrieved content. This can be expressed mathematically as follows:
Gather evidence: In this step, we gathered the evidence relevant to the question from the retrieved literature. This was accomplished through effective prompt engineering. Firstly, this step minimized irrelevant noise, including parsing errors that may occur when identifying
PDF documents, allowing for a more streamlined question-answering process, and providing the option to completely reject certain segments. Secondly, independent extraction of multiple segments can occur simultaneously, thereby saving processing time. Each piece of information is represented by the following equation:
Answer question based on evidence: Finally, the previously collected relevant information was combined into a specific prompt template and provided to the LLM. The prompt includes elements of a reasoning chain, guiding the LLM to infer step-by-step to generate an answer. The LLM synthesizes these clues to produce a coherent and logical response or, in cases of insufficient clues, chooses to refuse to answer, thereby avoiding incorrect or misleading answers. This step ensures that the final answer is accurate and evidence-based, enhancing the reliability and quality of the question-answering system while providing an option to decline when necessary.
4.5. The Evaluation on the Question-Answer System
In this study, we utilized the following indices (correctness, answer relevance, faithfulness, and context relevance) to evaluate the performance in the individual research field of human health risk assessment [
36].
Correctness. The correctness of an answer primarily involves two aspects: the factual accuracy, and the semantic similarity between the answer and ground truth. These two aspects were combined through a weighted approach to obtain the final correctness score.
For factual accuracy (
Fc), using the LLM, we can split the generated answer (
A) and the reference answer (
RA) into multiple simpler sentences. This step was defined as
S(·), and thus we have obtained two sets:
The correctness
Fc quantifies the factual overlap between the generated answer and the reference answer:
where true positives (
TP) are the facts that are present in both the generated answer and the reference answer, and false positives (
FP) are statements that are present in the generated answer but do not appear in the reference answer. False negatives (
FN) are statements that appear in the reference answer but do not appear in the generated answer.
On another note, semantic similarity (Ass) evaluates the semantic similarity between the generated answer and the reference answer, with values ranging from 0 to 1. A higher score indicates greater consistency between the answers. Measuring the semantic similarity between answers provides valuable insights into the quality of the generated responses. In this study, we used OpenAI’s text-embedding-3-large model to vectorize the text and then compute the cosine similarity between the semantic vectors.
Finally, by combining the factual correctness and the answer semantic similarity with weighted factors, we obtained the overall correctness of the answer:
where
w1 and
w2 are the weights. In this study, we assumed a
w1 of 0.75 and a
w2 of 0.25.
Answer relevance. The answer relevance (
AR) aims to assess the relevance of generated answers to the questions posed. Answers that are incomplete or contain redundant information are assigned lower scores, while higher scores indicate better relevance. In our study, first, given the generated answer A, a set of questions related to A was generated using a language model,
, where each sub-question
qi is directly related to the answer. The relevance of the answer to the question was calculated as the average semantic similarity between each sub-question and the original question, using the same method for computing semantic similarity as described earlier.
Faithfulness. Faithfulness is used to measure the factual consistency between the generated answer and the given context. This is determined by generating answer (A) and the provided context context(q). If all of the statements made in the answer can be inferred from the given context, the generated answer is considered to be faithful. To calculate this, a set of statements was first extracted from the generated answer. Then, each of these statements was cross-checked against the given context to determine whether it can be inferred from it. After that, two calls to the LLM were made. The first call attempted to split a segment of the answer into a set of statements, denoted as function S(∙), with the statement set represented as S = . The second call determined whether each individual statement could be inferred from the context, denoted as function V(si, context(q)). The set of all statements that can be inferred was stated as V = . Finally, the faithfulness of the answer was termed as the proportion of the number of elements in set V to the total number of elements in set S.
Context relevance. Generally, the retrieved context should only contain essential information necessary to address the provided query. Given a question
q and associated context
context(q), this study determined context relevance (
CR) by evaluating the proportion of critical information within the context. First, an LLM was used to extract a set of sentences
S that were crucial to the question from the context. Then, the proportion of S in context was calculated as follows:
This metric was used to evaluate the quality of the context obtained from different retrieval methods, with values ranging from 0 to 1, where a higher value indicated better retrieval quality.
5. Results and Discussion
5.1. The Evaluation on Generation of Question–Answer Pairs
Based on the process stated in
Supplementary Material Text S1 and the keywords provided in
Table S1, this study has collected a total of 1500 articles. The number of articles in the dataset varied with publication time, as shown in
Supplementary Material Figure S1, indicating a rapid increase from 2000 to 2020, especially during 2015–2019, where the number of articles increased by approximately 167% compared to 2010–2014. The number of articles from 2020 to 2024 remained on par with that of 2015–2019.
In the question–answer pairs generation process, the basic procedure involves converting topics into questions and then using research content to provide answers. However, some topics are not suitable for conversion into questions (in fact, only about 65–75% of the literature is appropriate for generating Q&A pairs), leading to a mismatch between the number of generated question–answer pairs and the quantity of literature. As shown in
Supplementary Material Figure S2, the number of question–answer pairs varied with publication time, illustrating a similar trend to the number of publications as plotted in
Supplementary Material Figure S1. To ensure the balance of the dataset, we selected 50 high-quality question–answer pairs from each field, totaling 300 pairs for the test dataset. Here, we presented a pair from a study on the biodegradation of phthalic acid esters (PAEs), as shown in
Supplementary Material Figure S3, with another three examples available in
Supplementary Material Table S2. Each entry in the database is stored as a dictionary containing the question, answer, source_context, DOI, and publication time.
This study has demonstrated that prompt templates can significantly enhance the quality of generated question–answer pairs. As illustrated in
Supplementary Material Figure S4, the case presented was derived from a study on the impact of perfluorooctanesulfonic acid (PFOS) on plant phosphate transporter gene networks. The figure compares the differences in question–answer pairs before and after the application of prompt engineering. When only basic prompts were used (consisting solely of a simple task description), the questions exhibited some ambiguity, and the explanations of the mechanisms in the answers were incomplete. For instance, the original question generated was: “
What was the focus of the study mentioned in the text regarding perfluorooctanesulfonic acid (PFOS) and plants?” [
37]. The reference to “
the study” introduced semantic vagueness, as readers might not clearly understand which specific research was being referred to. After optimization through prompt engineering, the question was restructured to: “
What role do phosphate transporters play in PFOS sensing in plants?”. This prompt engineering significantly improved the precision of the question and the relevance of the answer.
In this section, we designed and implemented an innovative automated question–answer pair generation process. Compared to manual annotation, our approach significantly improved the efficiency and quality of question–answer pair generation. Additionally, by optimizing prompts, we enhanced both the precision of question formulations and the relevance and comprehensiveness of the answers. Ultimately, this study generated 300 high-quality standard question–answer pairs, which will serve as a benchmark dataset for evaluating the performance of Q&A systems.
5.2. The Performance Evaluation
Based on 300 high-quality question–answer pairs, we conducted performance testing on a naive RAG question-answering system integrated with advanced retrieval techniques, including dual-layer retrieval, RAG-Fusion, and Step-back, alongside four commonly used LLMs (gpt-3.5-turbo, gpt-4:
https://platform.openai.com/docs/models/ (accessed on 12 August 2024); glm-3-turbo, and glm-4:
https://github.com/THUDM/ChatGLM3/blob/main/README_en.md (accessed on 12 August 2024)) in the individual research field of human health risk assessment. The accuracy of the models’ responses to questions is shown in
Table 1. Results indicated that the advanced retrieval-enhanced Q&A system performed best across five research subfields, with accuracy ranging from 0.606 to 0.723. This performance surpassed all four foundational large language models, including the
largest GPT-4 model. Only in the health risk assessment domain was the advanced Q&A system’s accuracy (0.583) slightly below that of the naive RAG Q&A system (0.599).
These findings highlight the superiority of the advanced retrieval-enhanced Q&A system in handling complex Q&A tasks. The system effectively integrates information from various sources through the dual-layer retrieval mechanism, while RAG-Fusion further optimizes the information merging process, and the Step-back mechanism allows for necessary backtracking during answer generation to ensure accuracy and comprehensiveness. The synergy of these techniques significantly improves the system’s answer correctness across multiple domains.
It is also worth noting that in specific domains, such as toxicity, gpt-3.5-turbo (with an accuracy of 0.509) outperformed its upgraded version, gpt-4 (accuracy of 0.492). A similar situation was observed between glm-3-turbo and glm-4. This phenomenon may be related to the introduction of false positives during the evaluation process. The presence of false positives can negatively impact the assessment of longer responses, as they may be incorrectly deemed irrelevant or incorrect despite being accurate in content. Additionally, we observed that answers generated by the GLM-4 model are often more verbose, which may affect the accuracy of the evaluation results in certain cases.
Compared to the Naive RAG model, the retrieval-enhanced system generally demonstrates a significant advantage in improving accuracy. This indicates that information extracted from abstracts is more beneficial than that from full text, particularly since abstracts generally contain less noise and can be directly obtained, while full-text extraction from PDF documents can lead to context loss. Additionally, the dual-layer retrieval mechanism allows for more precise extraction from the vector database, minimizing noise. These mechanisms enable our Advanced RAG system to achieve the highest accuracy in testing with question–answer pairs.
Additionally, we evaluated the performance on answer relevance. As shown in
Supplementary Material Table S3, most models performed well in terms of answer relevance, with scores generally close to or exceeding 0.9. The Advanced RAG Q&A system achieved relevance scores that were close to or higher than those of other models across six domains, indicating that the generated answers were highly relevant to the questions, with concise content and minimal redundancy. Although answer relevance assessment does not directly correlate with accuracy, high-relevance answers typically contain more useful information, reflecting the model’s ability to respond to user queries effectively.
We presented the faithfulness and context relevance of both the Naive RAG and Advanced RAG systems (see
Table 2). Both systems demonstrated very high fidelity (over 90%), indicating that the answers provided by the models are mostly derived from the retrieved content rather than being hallucinated, thus ensuring factual consistency with the knowledge base. On the other hand, the context relevance of the Naive retrieval-enhanced system was generally higher than that of the Advanced RAG.
In summary, based on the performance of different Q&A systems on these 300 question–answer pairs, we found that the Advanced RAG system achieved the highest accuracy, followed by the Naive RAG system, both outperforming large language models. Additionally, both Advanced RAG and Naive RAG demonstrated excellent results in answer relevance, faithfulness, and context relevance. These testing results indicate that our Advanced RAG system, built on a large literature database, is well-suited to address relevant professional questions in the field of health risk assessment.
5.3. The Ablation Experiment for Advanced RAG
Ablation study refers to the process of removing or “ablating” different parts of a model to evaluate the impact of each component on the model’s performance [
38]. Through ablation studies, we can gain insights into the internal mechanisms of the model and understand the importance and contributions of various components. As shown in
Figure S5, the entire Advanced RAG Q&A system is mainly composed of four components:
Retrieval 1 (summary retrieval),
Retrieval 2 (content retrieval),
Information Extraction (IE), and
Answer Generation. There are also two auxiliary components,
RAG-Fusion and
Step-back, which are located within the summary retrieval and answer generation components, respectively.
As shown in
Table 3, in the subfield of exposure, transport and fate, and human health risk, the removal of the IE component led to the largest drop in accuracy, with decreases of 6.3%, 6.8%, and 6.7%, respectively, indicating the importance of the IE component for system performance. The decline in performance after removing IE may be due to the need for the system to recognize and convert PDF-formatted literature during loading, which introduces noise. Without IE, this noise can directly enter the LLM’s input, preventing the LLM from identifying effective information to answer questions. Additionally, when splitting PDF documents, we used the
Recursive Character Text Splitter method provided by
LangChain (
https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/ (accessed on 23 July 2024)), which directly cuts the document into specified sizes (set to 1000 in this study), potentially leading to the loss of some contextual information and formatting details.
In the fields of analytical methods and toxicity, the performance decline was greatest when main content retrieval was removed, with decreases of 13.2% and 8.7%, respectively, compared to the full system. This indicates that retrieving the main text content of the literature plays a positive role in answering questions in these specific fields. The absence of summary retrieval led to a decline in performance across six areas, with the largest drop in the toxicokinetics field (7.9% decrease), further underscoring the importance of abstract matching. In summary, results from the ablation experiment well demonstrates the importance of individual components in the Advanced RAG Q&A system.
5.4. The Design of a Multi-Knowledge Base Integrated Question-Answering System
This study has demonstrated that retrieval-augmented generation can improve the LLM’s answering capability in knowledge-intensive tasks. Meanwhile, it is important to note that our study was conducted within specific subfields. However, in practical operations or the design of a knowledge Q&A system, the first challenge when a user poses a question is to identify the specific subfield and related subfields. This is because knowledge in a field should be subdivided into multiple sub-knowledge modules, and a comprehensive knowledge question-answering system should be able to retrieve and integrate appropriate information from these modules, synthesizing answers from multiple sources. However, previous knowledge base Q&A systems always considered embedding documents into the same vector space for retrieval, whether once or multiple times. These methods assume that all knowledge is treated as a single entity to find the most relevant content to assist the LLM in answering questions, without considering the need for interdisciplinary knowledge. These approaches would limit the efficiency and accuracy of the Q&A system’s responses.
Therefore, we have built an advanced retrieval system across six domains as submodules, integrating them into a comprehensive system (
Figure 5). When given a user question, the knowledge from these modules will be selectively activated to provide knowledge blocks relevant to the question. Given the LLM’s limitations on context size and processing speed, it is not feasible to input excessive information indefinitely. Thus, before knowledge serves as input to the LLM, this study will use a ranking technique to filter out the most relevant knowledge blocks related to the question.
Intent recognition and task distribution: Upon receiving a user question, the Q&A system first invokes the LLM and asks, ‘Do you need additional information to solve this problem?’ to get a YES or NO response. If YES is chosen, the system then prompts the LLM to select the appropriate module from six knowledge modules to retrieve the needed information.
Retrieving knowledge modules: A total of six knowledge modules have been established, each consisting of an advanced retrieval system with dual-layer retrieval and information extraction. Specific configurations of the Advanced RAG system can be found in
Section 4.4. It is important to note that this only includes the retrieval component of the Advanced RAG Q&A system, meaning each module outputs
K knowledge blocks, which are derived from the literature content retrieved from each domain and processed through LLM information extraction. However, if each knowledge module outputs
K pieces of information (with
K set to 10), the LLM would receive up to 60 pieces of information during the final answering phase. Considering the LLM’s context length limitations, ranking techniques will be applied to filter the knowledge block collection to the top-k knowledge blocks.
Generating answers from integrated knowledge blocks: The information from the knowledge blocks would be incorporated into a carefully designed prompt that utilizes a chain-of-thought approach, guiding the LLM on how to utilize the knowledge block information to think through and gradually solve the user’s problem.
In practical applications, not all questions require multiple retrieval enhancements to obtain the correct answer. Actually, some questions can be directly answered by the LLM, while relevant information for others may not be retrievable from the literature database. Considering the costs associated with LLM calls and the time required for answering, the complete system design involves various processing branches beyond the process described in the previous section. This LLM system, also referred to as an Agent, can intelligently select branches and complete the branching processes.
Hence, to better meet practical needs, this study expands on the aforementioned process (
Figure 5) by adding more information processing branch steps, as illustrated in
Supplementary Materials Figure S6. We utilize the
LangGraph framework to develop the entire Q&A system. Specifically, this framework conceptualizes a Q&A system as a directed acyclic graph composed of multiple data processing nodes and edges. Each step is abstracted into independent nodes, and each invocation of the LLM or retrieval from the vector database is included within these nodes. Directed edges connect these nodes, with the direction indicating the next step in data processing. Additionally, some edges are conditional, allowing for branching to different data processing paths when the previous node returns specific values.
In summary, the integrated Q&A system proposed in this study significantly enhances flexibility and scalability by subdividing the knowledge base into multiple submodules. This modular design allows for independent updates and maintenance of each submodule without large-scale modifications to the overall system architecture. As new research areas or topics emerge, new submodules can be easily integrated without affecting the stability of the existing system. Moreover, the system can selectively activate relevant modules based on the specific requirements of the question, optimizing resource allocation and improving operational efficiency.
6. Conclusions
In this study, we established an automatic method for generating high-quality question–answer pairs and produced 300 relevant pairs in the field of health risk assessment. Secondly, we successfully developed an Advanced RAG Q&A system that integrates a dual-layer retrieval and information extraction mechanism, which incorporates novel retrieval techniques such as RAG-Fusion and Step-back. Testing results based on the question–answer pairs indicate that our developed system outperforms both naive retrieval systems and large language models without retrieval enhancement in terms of answer accuracy and relevance. This result validates the limitations of LLMs when handling specialized question-answering tasks and demonstrates that retrieval enhancement can alleviate this issue to some extent. Lastly, this study employed the LangGraph framework to abstract the entire data processing flow into a graph data structure and successfully integrated the advanced retrieval framework into a comprehensive Q&A system, thus providing users with an efficient information query and processing solution.
The theoretical significance of this research lies in the combination of RAG technology with large language models, optimizing knowledge retrieval methods in human health risk assessment and advancing the application of natural language processing technologies in specialized fields. Practically, the multi-knowledge base question-answering system we developed improves the efficiency of literature retrieval and information extraction, helping researchers obtain relevant knowledge more quickly and accurately. This system provides practical tools for health risk assessment and interdisciplinary collaboration, promoting decision support and knowledge sharing.
This study assumes that most specialized problems can be addressed by utilizing the facts, concepts, and processes from paper abstracts or content. We particularly emphasize that the paper abstract can provide concise key information and is often an effective starting point for problem solving, especially when dealing with high-level issues in specialized fields. However, we recognize that relying solely on the abstract may sometimes be insufficient, particularly when the abstract is overly brief or vague. Therefore, while this study relies on the paper’s abstract, it also incorporates the content of the paper to ensure the accuracy and comprehensiveness of the information. Specifically, we employ a dual-retrieval strategy that combines the processing of both the abstract and the content, reducing the risk of bias or misguidance that may arise from relying solely on the abstract. We acknowledge that the information in the abstract may indeed have certain biases or limitations; thus, during the final decision-making process, we carefully evaluate the accuracy of the abstract and validate and supplement the information through further retrieval processes. This approach helps us ensure efficiency while minimizing potential misunderstandings.
The process of building the knowledge base in this study includes PDF conversion recognition and document chunking. We noted that if the PDF documents cannot be accurately recognized, or if the retrieval algorithm fails to obtain sufficient valid information from the database, it may affect the final results. To improve the quality of the contextual content retrieved, this study adopted a dual-layer retrieval strategy, which alleviates the limitations of vector-matching retrieval algorithms to some extent, significantly enhancing the question-answering effectiveness. However, despite the precise recognition of PDF documents improving the validity of contextual information, the technology underlying this method remains immature, and our vector database still relies on the traditional “document chunking—vectorization” building process. Even with information extraction based on LLMs, noise may still affect the accuracy of the final LLM responses. Retrieval enhancement based on literature data inevitably encounters issues such as format recognition errors and retrieval accuracy. Future research directions should focus on integrating PDF recognition with retrieval enhancement to better address these issues.