RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

This is the official implementation of the RAS paper.

Environment Setup

Please follow the commands below (in exact order) to setup the environment.

# Create new env
conda create -n ras python=3.10 -y

# Activate it
conda activate ras

# Install PyTorch first, separately
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

pip install torch-geometric

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.5.1+cu118.html #(depends on your cuda version)

pip install transformers wandb tqdm peft accelerate bitsandbytes sentencepiece

Train Theme Classifier and Distribution Shifter

First, download DBPedia-298 dataset from here.

[Train]

cd classifier_shifter
sh doc_train.sh
sh shifter_train.sh # please process HotpotQA-SUBQ data (see below) before this

[Test]

cd classifier_shifter
python theme_predictor.py

Train Text-to-Triples Model

First, download WikiOFGraph dataset from here.

cd text_to_triples
sh train.sh

Training Data (HotpotQA-SUBQ) Processing

You can download the processed data from here.

Alternatively, you can process the data by yourself as follows:

First, download hotpotqa training set from here.

Then, run the following commands to process the training data.

cd llm_training_data_process
# Process the hotpotqa data
python process_hotpot.py

# Generate subqueries for hotpotqa
python generate_subqueries.py

# identify questions that don't need both subqueries and retrieval
python training_data_gen_wo_ret_wo_subq.py

cd ../text_to_triples

# generate triples for hotpotqa docs
python generate.py

# process the data with graphs
cd ../llm_training_data_process
python a_planner_data_process.py
python a_1_hotpotqa_only.py
python b_answerer_data_process.py

Train GraphLLM by Multi-task Learning (w/ processed training data)

[Train]

cd framework
sh train.sh # or train_8b.sh for 8B model

[Test] (w/ hotpotqa-subq validation data)

sh test_planner.sh
sh test_answerer.sh

Knowledge Indexing (Prepare both Theme and Dense Faiss Indexes)

# Download corpora
cd knowledge_indexing
sh download_corpora.sh

# Theme Indexing
cd theme
sh class_labeling.sh
sh convert.sh

# Dense Indexing
cd ../dense
sh dense_index.sh
sh combine.sh

Run Baselines

cd baselines
sh run.sh # (please see the arguments in the run.sh file to change the dataset, model, etc.)

Run RAS

cd framework
sh run_ras.sh

Evaluation

run the eval.ipynb file in the framework/ folder.

Uncomment the metrics you want to evaluate on.

Note:

To run closed-source Sonnet-3.5 in either baselines' setting or RAS, please fill in the key information in the claude_api_example.py file, and rename it to claude_api.py, and put it under both baselines/ and framework/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Table of Contents

Environment Setup

Train Theme Classifier and Distribution Shifter

[Train]

[Test]

Train Text-to-Triples Model

Training Data (HotpotQA-SUBQ) Processing

Train GraphLLM by Multi-task Learning (w/ processed training data)

[Train]

[Test] (w/ hotpotqa-subq validation data)

Knowledge Indexing (Prepare both Theme and Dense Faiss Indexes)

Run Baselines

Run RAS

Evaluation

Note:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
baselines		baselines
classifier_shifter		classifier_shifter
framework		framework
images		images
knowledge_indexing		knowledge_indexing
llm_training_data_process		llm_training_data_process
problem_analysis		problem_analysis
text_to_triples		text_to_triples
.gitignore		.gitignore
claude_api_example.py		claude_api_example.py
readme.md		readme.md

pat-jj/RAS

Folders and files

Latest commit

History

Repository files navigation

RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Table of Contents

Environment Setup

Train Theme Classifier and Distribution Shifter

[Train]

[Test]

Train Text-to-Triples Model

Training Data (HotpotQA-SUBQ) Processing

Train GraphLLM by Multi-task Learning (w/ processed training data)

[Train]

[Test] (w/ hotpotqa-subq validation data)

Knowledge Indexing (Prepare both Theme and Dense Faiss Indexes)

Run Baselines

Run RAS

Evaluation

Note:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages