Unified implementations and rigorous evaluation for offline reinforcement learning - built by Matthew Jackson, Uljad Berdica, and Jarek Liesen.
- ⚛️ Single-file: We implement algorithms as standalone Python files.
- 🤏 Minimal: We only edit what is necessary between algorithms, making comparisons straightforward.
- ⚡️ GPU-accelerated: We use JAX and end-to-end compile all training code, enabling lightning-fast training.
Inspired by CORL and CleanRL - check them out!
We provide two types of algorithm implementation:
- Standalone: Each algorithm is implemented as a single file with minimal dependencies, making it easy to understand and modify.
- Unified: Most algorithms are available as configs for our unified implementation
unifloral.py
.
After training, final evaluation results are saved to .npz
files in final_returns/
for analysis using our evaluation protocol.
All scripts support D4RL and use Weights & Biases for logging, with configs provided as WandB sweep files.
Algorithm | Standalone | Unified | Extras |
---|---|---|---|
BC | bc.py |
unifloral/bc.yaml |
- |
SAC-N | sac_n.py |
unifloral/sac_n.yaml |
[ArXiv] |
EDAC | edac.py |
unifloral/edac.yaml |
[ArXiv] |
CQL | cql.py |
- | [ArXiv] |
IQL | iql.py |
unifloral/iql.yaml |
[ArXiv] |
TD3-BC | td3_bc.py |
unifloral/td3_bc.yaml |
[ArXiv] |
ReBRAC | rebrac.py |
unifloral/rebrac.yaml |
[ArXiv] |
TD3-AWR | - | unifloral/td3_awr.yaml |
[ArXiv] |
We implement a single script for dynamics model training: dynamics.py
, with config dynamics.yaml
.
Algorithm | Standalone | Unified | Extras |
---|---|---|---|
MOPO | mopo.py |
- | [ArXiv] |
MOReL | morel.py |
- | [ArXiv] |
COMBO | combo.py |
- | [ArXiv] |
MoBRAC | - | unifloral/mobrac.yaml |
[ArXiv] |
New ones coming soon 👀
Our evaluation script (evaluation.py
) implements the protocol described in our paper, analysing the performance of a UCB bandit over a range of policy evaluations.
from evaluation import load_results_dataframe, bootstrap_bandit_trials
import jax.numpy as jnp
# Load all results from the final_returns directory
df = load_results_dataframe("final_returns")
# Run bandit trials with bootstrapped confidence intervals
results = bootstrap_bandit_trials(
returns_array=jnp.array(policy_returns), # Shape: (num_policies, num_rollouts)
num_subsample=8, # Number of policies to subsample
num_repeats=1000, # Number of bandit trials
max_pulls=200, # Maximum pulls per trial
ucb_alpha=2.0, # UCB exploration coefficient
n_bootstraps=1000, # Bootstrap samples for confidence intervals
confidence=0.95 # Confidence level
)
# Access results
pulls = results["pulls"] # Number of pulls at each step
means = results["estimated_bests_mean"] # Mean score of estimated best policy
ci_low = results["estimated_bests_ci_low"] # Lower confidence bound
ci_high = results["estimated_bests_ci_high"] # Upper confidence bound
@misc{jackson2025clean,
title={A Clean Slate for Offline Reinforcement Learning},
author={Matthew Thomas Jackson and Uljad Berdica and Jarek Liesen and Shimon Whiteson and Jakob Nicolaus Foerster},
year={2025},
eprint={2504.11453},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.11453},
}