8000 GitHub - EmptyJackson/unifloral: Unified Implementations of Offline Reinforcement Learning Algorithms
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Unified Implementations of Offline Reinforcement Learning Algorithms

License

Notifications You must be signed in to change notification settings

EmptyJackson/unifloral

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌹 Unifloral: Unified Offline Reinforcement Learning

Unified implementations and rigorous evaluation for offline reinforcement learning - built by Matthew Jackson, Uljad Berdica, and Jarek Liesen.

💡 Code Philosophy

  • ⚛️ Single-file: We implement algorithms as standalone Python files.
  • 🤏 Minimal: We only edit what is necessary between algorithms, making comparisons straightforward.
  • ⚡️ GPU-accelerated: We use JAX and end-to-end compile all training code, enabling lightning-fast training.

Inspired by CORL and CleanRL - check them out!

🤖 Algorithms

We provide two types of algorithm implementation:

  1. Standalone: Each algorithm is implemented as a single file with minimal dependencies, making it easy to understand and modify.
  2. Unified: Most algorithms are available as configs for our unified implementation unifloral.py.

After training, final evaluation results are saved to .npz files in final_returns/ for analysis using our evaluation protocol.

All scripts support D4RL and use Weights & Biases for logging, with configs provided as WandB sweep files.

Model-free

Algorithm Standalone Unified Extras
BC bc.py unifloral/bc.yaml -
SAC-N sac_n.py unifloral/sac_n.yaml [ArXiv]
EDAC edac.py unifloral/edac.yaml [ArXiv]
CQL cql.py - [ArXiv]
IQL iql.py unifloral/iql.yaml [ArXiv]
TD3-BC td3_bc.py unifloral/td3_bc.yaml [ArXiv]
ReBRAC rebrac.py unifloral/rebrac.yaml [ArXiv]
TD3-AWR - unifloral/td3_awr.yaml [ArXiv]

Model-based

We implement a single script for dynamics model training: dynamics.py, with config dynamics.yaml.

Algorithm Standalone Unified Extras
MOPO mopo.py - [ArXiv]
MOReL morel.py - [ArXiv]
COMBO combo.py - [ArXiv]
MoBRAC - unifloral/mobrac.yaml [ArXiv]

New ones coming soon 👀

📊 Evaluation

Our evaluation script (evaluation.py) implements the protocol described in our paper, analysing the performance of a UCB bandit over a range of policy evaluations.

from evaluation import load_results_dataframe, bootstrap_bandit_trials
import jax.numpy as jnp

# Load all results from the final_returns directory
df = load_results_dataframe("final_returns")

# Run bandit trials with bootstrapped confidence intervals
results = bootstrap_bandit_trials(
    returns_array=jnp.array(policy_returns),  # Shape: (num_policies, num_rollouts)
    num_subsample=8,     # Number of policies to subsample
    num_repeats=1000,    # Number of bandit trials
    max_pulls=200,       # Maximum pulls per trial
    ucb_alpha=2.0,       # UCB exploration coefficient
    n_bootstraps=1000,   # Bootstrap samples for confidence intervals
    confidence=0.95      # Confidence level
)

# Access results
pulls = results["pulls"]                      # Number of pulls at each step
means = results["estimated_bests_mean"]       # Mean score of estimated best policy
ci_low = results["estimated_bests_ci_low"]    # Lower confidence bound
ci_high = results["estimated_bests_ci_high"]  # Upper confidence bound

📝 Cite us!

@misc{jackson2025clean,
      title={A Clean Slate for Offline Reinforcement Learning},
      author={Matthew Thomas Jackson and Uljad Berdica and Jarek Liesen and Shimon Whiteson and Jakob Nicolaus Foerster},
      year={2025},
      eprint={2504.11453},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.11453},
}
0