OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning
CQL: Conservative Q-Learning for offline reinforcement learning
Is Pessimism Provably Efficient for Offline RL?
THE IMPORTANCE OF PESSIMISM IN FIXED-DATASET POLICY OPTIMIZATION
Off-Policy Deep Reinforcement Learning without Exploration, BCQ, ICML 2019.
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction, BEAR, including analysis of OOD Actions in Q-Learning.
Behavior Regularized Offline Reinforcement Learning, BRAC, generalizing BEAR, BCQ, etc.
Critic Regularized Regression, CRR, Nips 2020.
Q-Value Weighted Regression: Reinforcement Learning with Limited Data An extension of [KEEP DOING WHAT WORKED ...], unaccepted.
Batch Reinforcement Learning Through Continuation Method
MOReL: Model-Based Offline Reinforcement Learning
MOPO: Model-based Offline Policy Optimization
COMBO: Conservative Offline Model-Based Policy Optimization
MODEL-BASED OFFLINE PLANNING ICLR 2021, 8755
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization NIPS 2020
Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization, ICLR 2021
OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCELERATING OFFLINE REINFORCEMENT LEARNING, Offline + hrl
Multi-task Batch Reinforcement Learning with Metric Learning, NIPS2020, Multi task, Generalize to unseen tasks. BATCH REINFORCEMENT LEARNING THROUGH CONTINUATION METHOD, ICLR 2020, Offlien+continution.
COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning, CoRL 2020.
KEEP DOING WHAT WORKED: BEHAVIOR MODELLING PRIORS FOR OFFLINE REINFORCEMENT LEARNING ICLR 2020.
Offline Reinforcement Learning
d4rl_evaluations, tensorflow
AWAC, CQL, MOPO, pytorch
polixir, pytorch
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning,2019, survey of OPE.
AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION, ICLR 2021.
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies, NIPS 2020.
Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes, code
AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION, ICLR 2021.
Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluationcode
Batch Policy Learning under Constraints, FQE
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning, Doubly-robust(DR)
DualDICE,from google-research.
github-code,Implementation of importance sampling, direct, and hybrid methods for off-policy evaluation.
code-Empirical Study of Off Policy Policy Estimation,OPE Tools
BENCHMARKS FOR DEEP OFF-POLICY EVALUATION, [code], ICLR 2021, This release provides: 1)Policies for the tasks in the D4RL, DeepMind Locomotion and Control Suite datasets (described below). 2) Policies trained with the following algorithms (D4PG, ABM, CRR, SAC, DAPG and BC) and snapshots along the training trajectory. This faciliates benchmarking offline model selection.[auxiliary code], [auxiliary code dice]
GAIL- Generative Adversarial Imitation Learning
Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model, CoRR, 2016 propose a method to combine an inverse dynamics policy learned with expert demonstrations and a simulator policy trained in simulator.
Offline Imitation Learning with a Misspecified Simulator, Nips, 2020
Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning
Error Bounds of Imitating Policies and Environments The paper analyzes the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding errors compared to behavioral cloning, and thus has a better sample complexity.
SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT
Between MDPs and semi-MDPs:A framework for temporal abstractionin reinforcement learning Sutton, 1999, option framework
Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies, 2014, JMLR,Pareto in RL.
Bridging Theory and Algorithm for Domain Adaptation, domain transfer, MMD.