Understanding Impact of Human Feedback via Influence Functions

We provide a codebase for "Understanding Impact of Human Feedback via Influence Functions". Our work utilizes influence functions to measure the impact of human feedback on the performance of reward models. In this repository, we provide source code to replicate our work, specifically length/sycophancy bias detection.

Install

create conda environment

conda create -n if_rlhf python=3.10 absl-py pyparsing pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate if_rlhf
conda env config vars set IF_RLHF_HOME=/path/to/current/directory
conda env config vars set WANDB_PROJECT=IF_RLHF # for wandb logging
conda deactivate && conda activate if_rlhf
cd $IF_RLHF_HOME # check home directory

check gpu

import torch
torch.cuda.is_available() # should show True

install the remaining package dependencies

python -m pip install -e .
pip install -r requirements.txt

install flash attention

MAX_JOBS=4 pip install flash-attn --no-build-isolation

for deepspeed

conda install -c conda-forge mpi4py mpich

Login to huggingface, wandb

huggingface-cli login
wandb login

Datasets

First prepare length and sycophancy biased datasets (15k subset of Anthropic/HH-rlhf dataset).

cd $IF_RLHF_HOME
mkdir dataset
python src/reward_modeling/make_dataset.py

Reward Modeling

Re 8000 ward Modeling using length biased dataset

example script for training reward model (based on LLama3-8B) on length biased dataset

CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_length.yaml

Reward Modeling using sycophancy biased dataset

CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_sycophancy.yaml

Influence Computation

Cache Gradients (for efficient storage of gradients)

Length bias

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42

Sycophancy bias

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42

Compute Influence Functions using DataInf

Follow the scripts in measure_length_bias.ipynb and measure_sycophancy_bias.ipynb to compute influence values and plot reciever operating characteristics curves.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
recipes		recipes
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Impact of Human Feedback via Influence Functions

Install

create conda environment

check gpu

install the remaining package dependencies

install flash attention

for deepspeed

Login to huggingface, wandb

Datasets

Reward Modeling

Re 8000 ward Modeling using length biased dataset

Reward Modeling using sycophancy biased dataset

Influence Computation

Cache Gradients (for efficient storage of gradients)

Compute Influence Functions using DataInf

About

Releases

Packages

Contributors 22

Languages

License

mintaywon/IF_RLHF

Folders and files

Latest commit

History

Repository files navigation

Understanding Impact of Human Feedback via Influence Functions

Install

create conda environment

check gpu

install the remaining package dependencies

install flash attention

for deepspeed

Login to huggingface, wandb

Datasets

Reward Modeling

Re 8000 ward Modeling using length biased dataset

Reward Modeling using sycophancy biased dataset

Influence Computation

Cache Gradients (for efficient storage of gradients)

Compute Influence Functions using DataInf

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 22

Languages

Packages