8000 GitHub - pat-jj/RAS: RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ RAS Public

RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Notifications You must be signed in to change notification settings

pat-jj/RAS

Repository files navigation

RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

This is the official implementation of the RAS paper.

alt text

Table of Contents


Environment Setup

Please follow the commands below (in exact order) to setup the environment.

# Create new env
conda create -n ras python=3.10 -y

# Activate it
conda activate ras

# Install PyTorch first, separately
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

pip install torch-geometric

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.5.1+cu118.html #(depends on your cuda version)

pip install transformers wandb tqdm peft accelerate bitsandbytes sentencepiece

Train Theme Classifier and Distribution Shifter

First, download DBPedia-298 dataset from here.

[Train]

cd classifier_shifter
sh doc_train.sh
sh shifter_train.sh # please process HotpotQA-SUBQ data (see below) before this

[Test]

cd classifier_shifter
python theme_predictor.py

Train Text-to-Triples Model

First, download WikiOFGraph dataset from here.

cd text_to_triples
sh train.sh

Training Data (HotpotQA-SUBQ) Processing

You can download the processed data from here.

Alternatively, you can process the data by yourself as follows:

First, download hotpotqa training set from here.

Then, run the following commands to process the training data.

cd llm_training_data_process
# Process the hotpotqa data
python process_hotpot.py

# Generate subqueries for hotpotqa
python generate_subqueries.py

# identify questions that don't need both subqueries and retrieval
python training_data_gen_wo_ret_wo_subq.py

cd ../text_to_triples

# generate triples for hotpotqa docs
python generate.py

# process the data with graphs
cd ../llm_training_data_process
python a_planner_data_process.py
python a_1_hotpotqa_only.py
python b_answerer_data_process.py

Train GraphLLM by Multi-task Learning (w/ processed training data)

[Train]

cd framework
sh train.sh # or train_8b.sh for 8B model

[Test] (w/ hotpotqa-subq validation data)

sh test_planner.sh
sh test_answerer.sh

Knowledge Indexing (Prepare both Theme and Dense Faiss Indexes)

# Download corpora
cd knowledge_indexing
sh download_corpora.sh

# Theme Indexing
cd theme
sh class_labeling.sh
sh convert.sh

# Dense Indexing
cd ../dense
sh dense_index.sh
sh combine.sh

Run Baselines

cd baselines
sh run.sh # (please see the arguments in the run.sh file to change the dataset, model, etc.)

Run RAS

cd framework
sh run_ras.sh

Evaluation

run the eval.ipynb file in the framework/ folder.

Uncomment the metrics you want to evaluate on.

Note:

To run closed-source Sonnet-3.5 in either baselines' setting or RAS, please fill in the key information in the claude_api_example.py file, and rename it to claude_api.py, and put it under both baselines/ and framework/.

About

RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0