Naive random agent with planning
Naive agent based on htm framework described in report. Key features:
- memorizes all transitions (r, s, a) -> (r', s', a') with a single [single- or multicolumn] Temporal Memory
- (r, s, a) triplets are encoded into single (s, a, r) SDR
- every part of SDR is encoded with naive integer encoder without overlaps then concatenated together
- can infer policy [a1, a2, .. aT] to the rewarding state if it's in the radius N of memorized transitions
- planning horizon is a hyperparameter
- make random action if planner fail to make a plan
- with planning horizon = 0 it degrades to random agent
Agent was tested on three gridworld MDPs (multi_way_v0-2) with different planning horizon and compared with random agent and simple DQN.
Key results:
- learns (=progresses) faster than DQN
- even planning horizon 1 is better than random
- with fixed planning horizon N advantage diminishes as environment complexity grows
- if planning horizon N is enough to plan to the reward from the initial state, it works perfect after very small number of training episodes (~ equal to the distance to the goal state)