8000 GitHub - mintaywon/IF_RLHF: Source code for 'Understanding impacts of human feedback via influence functions'
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Source code for 'Understanding impacts of human feedback via influence functions'

License

Notifications You must be signed in to change notification settings

mintaywon/IF_RLHF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Understanding Impact of Human Feedback via Influence Functions

Introductory Figure


We provide a codebase for "Understanding Impact of Human Feedback via Influence Functions". Our work utilizes influence functions to measure the impact of human feedback on the performance of reward models. In this repository, we provide source code to replicate our work, specifically length/sycophancy bias detection.

Install


create conda environment

conda create -n if_rlhf python=3.10 absl-py pyparsing pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate if_rlhf
conda env config vars set IF_RLHF_HOME=/path/to/current/directory
conda env config vars set WANDB_PROJECT=IF_RLHF # for wandb logging
conda deactivate && conda activate if_rlhf
cd $IF_RLHF_HOME # check home directory

check gpu

import torch
torch.cuda.is_available() # should show True

install the remaining package dependencies

python -m pip install -e .
pip install -r requirements.txt

install flash attention

MAX_JOBS=4 pip install flash-attn --no-build-isolation

for deepspeed

conda install -c conda-forge mpi4py mpich

Login to huggingface, wandb

huggingface-cli login
wandb login

Datasets

First prepare length and sycophancy biased datasets (15k subset of Anthropic/HH-rlhf dataset).

cd $IF_RLHF_HOME
mkdir dataset
python src/reward_modeling/make_dataset.py

Reward Modeling

Re 8000 ward Modeling using length biased dataset

example script for training reward model (based on LLama3-8B) on length biased dataset

CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_length.yaml

Reward Modeling using sycophancy biased dataset

CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_sycophancy.yaml

Influence Computation

Cache Gradients (for efficient storage of gradients)

  • Length bias
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42
  • Sycophancy bias
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42

CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42

Compute Influence Functions using DataInf

Follow the scripts in measure_length_bias.ipynb and measure_sycophancy_bias.ipynb to compute influence values and plot reciever operating characteristics curves.

About

Source code for 'Understanding impacts of human feedback via influence functions'

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0