Closed
Description
In TRL==0.11.0
, we can use multi-adapter to train PPO model like:
-
$\pi_\text{sft}$ sft model as base model -
$\pi_\text{sft} + \text{LoRA}_\text{rm}$ as reward model -
$\pi_\text{sft} + \text{LoRA}_\text{policy}$ as policy model -
$\pi_\text{sft} + \text{LoRA}_\text{critic}$ as value model
in v0.16.0 how to run multi-adapter PPO training.