8000 GitHub - sjoerdoffringa/RAGMeUp: Generic rag framework to apply the power of LLMs on any given dataset
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

sjoerdoffringa/RAGMeUp

 
 

Repository files navigation

RAG Evaluation

This repository builds on the RAGMeUp framework. This README is specific to the added RAG Evaluation framework. The framework was ran from Google Colab. It is advised to run the scripts from Colab_RAG_Eval.ipynb in the Colab environment. The file uses a .env template for evaluation. This template is loaded as .env file, and later described vari 6F0A ables can be changed by writing to this environment. Lastly, Ensure you have a HuggingFace Token to insert.

Run the eval_create_testset.py file to create a testset. This testset is a dataset of QA-pairs. It is saved as a .csv file in the folder testsets. If this directory does not exist in the server directory, it is created. Within this folder, a new folder is created to save the testset.csv file and a rag_chunk.pickle file, which stores the chunks that are parsed from the documents.
The following variables can be adjusted before creating the testset:

  • chunk_size set the size of chunks the script uses to generate questions from.
  • rerank_k set to define how many chunks the LLM uses to generate a question from. (It is advised keep rerank to True for RAG Evaluation).
  • eval_qa_pairs set the number of Question-Answer pairs that should be generated.
  • eval_sample_size set the number of chunks to sample from for generating QA-pairs.
  • eval_question_query set the prompt for generating questions.
  • eval_catch_irrelevant_chunks Set if a prompt should be added to the question query to allow the LLM not to create a question based on irrelevant chunks (True/False).
  • eval_catch_irrelevant_chunks_prompt Set the prompt to use if the previous variable is True.
  • eval_check_sample_relevance set if the LLM should first judge a chunk if it is relevant to generate a question from (True/False).
  • eval_check_sample_relevance_instruction set the instruction prompt if check_sample_relevance is True.
  • eval_check_sample_relevance_query set the query prompt if check_sample_relevance is True.
  • eval_retrieve_samples set if the same samples as a previously generated testset should be used (True/False).
  • eval_retrieve_samples_folder set the folder from which the testset should be retrieved if the previous variable is True.
  • eval_use_example_questions Set if a prompt should be added to the question query to provide example questions to the LLM.
  • eval_example_questions Set the example questions if the previous variable is True. Provide them as a string of a list.
  • eval_example_questions_prompt set the prompt to instruct the LLM what to do with the example questions if use_example_questions is True.

Run the eval_evaluate_RAG.py file to evaluate a RAG instance with a specified testset. The RAG's retrieved chunks and generated answers are added to the testset, and Recall and Recall-top-k are computed and printed. The resulting evalset in the same way as the testset, and as a excel file for inspection. The following variables can be adjusted before evaluating the RAG:

  • eval_testset_directory set the directory in which the testset to use can be found.
  • eval_RAG_instance_name set the name of the RAG instance, such that instances can be compared from their column names.
  • eval_ragas set if the Ragas library should be used to compute evaluation metrics. Note that this is expected to give a timeout or Out Of Memory error when running in Colab.

The repository includes the data that was used in the analysis in the server/data folder, but this can be changed to any documents. Also, an example testset based on this data is included in server/testsets/30QA.

About

Generic rag framework to apply the power of LLMs on any given dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.1%
  • HTML 13.0%
  • Scala 5.5%
  • CSS 3.2%
  • Jupyter Notebook 2.7%
  • Dockerfile 1.4%
  • Batchfile 0.1%
0