papers

Offline RL

polixir, pytorch

OPE

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning,2019, survey of OPE.

AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION, ICLR 2021.

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies, NIPS 2020.

Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes, code

AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION, ICLR 2021.

Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation code

Batch Policy Learning under Constraints, FQE

Doubly Robust Off-policy Value Evaluation for Reinforcement Learning, Doubly-robust(DR)

code

DualDICE,from google-research.

github-code,Implementation of importance sampling, direct, and hybrid methods for off-policy evaluation.

code-Empirical Study of Off Policy Policy Estimation,OPE Tools

BENCHMARKS FOR DEEP OFF-POLICY EVALUATION, [code], ICLR 2021, This release provides: 1)Policies for the tasks in the D4RL, DeepMind Locomotion and Control Suite datasets (described below). 2) Policies trained with the following algorithms (D4PG, ABM, CRR, SAC, DAPG and BC) and snapshots along the training trajectory. This faciliates benchmarking offline model selection.[auxiliary code], [auxiliary code dice]

IRL

GAIL- Generative Adversarial Imitation Learning

Sim2Real

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model, CoRR, 2016 propose a method to combine an inverse dynamics policy learned with expert demonstrations and a simulator policy trained in simulator.

Offline Imitation Learning with a Misspecified Simulator, Nips, 2020

Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning

IL

Error Bounds of Imitating Policies and Environments The paper analyzes the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding errors compared to behavioral cloning, and thus has a better sample complexity.

RL-self-supervised

SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT

HRL

Between MDPs and semi-MDPs:A framework for temporal abstractionin reinforcement learning Sutton, 1999, option framework

ML

Pareto

Pareto Multi-Task Learning

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies, 2014, JMLR,Pareto in RL.

Bridging Theory and Algorithm for Domain Adaptation, domain transfer， MMD.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

papers

Offline RL

Model free

Model based

app

datasets

doc

code

OPE

code

IRL

Sim2Real

IL

RL-self-supervised

HRL

ML

Pareto

About

Uh oh!

Releases

Packages

qxtian/papers

Folders and files

Latest commit

History

Repository files navigation

papers

Offline RL

Model free

Model based

app

datasets

doc

code

OPE

code

IRL

Sim2Real

IL

RL-self-supervised

HRL

ML

Pareto

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages