Famous successes with AlphaZero have been in games such as Chess and Go, where there is a clean reward signal, and the state space is discrete. Yet these results don't transfer well to real world control environments, where there is a continuous action space and noise.
Can AlphaZero-like approaches learn optimal low-level controls?
This repo produces an AlphaZero agent for continuous controls in ~250 lines of code.
Training takes ~6 min on my laptop running parallelized MCTS over 3M simulation steps.
Average rollout reward is ~ -7.11 for vanilla MCTS, vs -3.741 with A0C (guiding with value and actor net helps!).
To train AlphaZero Continuous, evaluate the learned policy, and generate plots:
python train.py
To run the online MCTS planner (useful for debugging search):
python run_mcts.py
- increase lag
- keep track of state history, run on controls challenge
- work for general Gym environment