We provide a codebase for "Understanding Impact of Human Feedback via Influence Functions". Our work utilizes influence functions to measure the impact of human feedback on the performance of reward models. In this repository, we provide source code to replicate our work, specifically length/sycophancy bias detection.
conda create -n if_rlhf python=3.10 absl-py pyparsing pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate if_rlhf
conda env config vars set IF_RLHF_HOME=/path/to/current/directory
conda env config vars set WANDB_PROJECT=IF_RLHF # for wandb logging
conda deactivate && conda activate if_rlhf
cd $IF_RLHF_HOME # check home directory
import torch
torch.cuda.is_available() # should show True
python -m pip install -e .
pip install -r requirements.txt
MAX_JOBS=4 pip install flash-attn --no-build-isolation
conda install -c conda-forge mpi4py mpich
huggingface-cli login
wandb login
First prepare length and sycophancy biased datasets (15k subset of Anthropic/HH-rlhf dataset).
cd $IF_RLHF_HOME
mkdir dataset
python src/reward_modeling/make_dataset.py
example script for training reward model (based on LLama3-8B) on length biased dataset
CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_length.yaml
CUDA_VISIBLE_DEVICES=0 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero2.yaml --num_processes=1 --main_process_port=1231 src/reward_modeling/reward_modeling.py recipes/reward_modeling/Llama-3-8B_sycophancy.yaml
- Length bias
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_length" \
--data_path "dataset/length_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42
- Sycophancy bias
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/train" \
--save_name "rapid_grad_train.pt" \
--seed 42
CUDA_VISIBLE_DEVICES=0 python src/influence/cache_gradients.py \
--model_path "logs/Llama-3-8B_sycophancy" \
--data_path "dataset/sycophancy_dataset/test" \
--save_name "rapid_grad_val.pt" \
--seed 42
Follow the scripts in measure_length_bias.ipynb and measure_sycophancy_bias.ipynb to compute influence values and plot reciever operating characteristics curves.