Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning
<p>A simple problem where naive exploration performs poorly. The agent starts in the second leftmost cell and is rewarded only for finding a treasure. It can move up/down/left/right, but if it moves up or down its position resets. Acting randomly at each step takes on average <math display="inline"><semantics> <msup> <mn>4</mn> <mi>N</mi> </msup> </semantics></math> attempts to find the rightmost and most valuable reward, where <span class="html-italic">N</span> is the number of cells to the right of the agent. As no reward is given in any intermediate cell, <math display="inline"><semantics> <mi>ϵ</mi> </semantics></math>-greedy exploration is likely to converge to the local optimum represented by the leftmost reward if <math display="inline"><semantics> <mi>ϵ</mi> </semantics></math> decays too quickly.</p> "> Figure 2
<p>Toy problem domain.</p> "> Figure 3
<p><b>Toy example (part 1)</b>. Visitation count (<b>a</b>) and behavior policies (<b>b</b>–<b>d</b>) for the toy problem below. Cells are divided into four triangles, each one representing the count (<b>a</b>) or the behavior Q-value (<b>b</b>–<b>d</b>) for the actions “up/down/left/right” in that cell. Arrows denote the action with the highest Q-value. None of the behavior policies exhibits long-term deep exploration. In (1,3) both UCB1 (<b>c</b>) and the augmented Q-table (<b>d</b>) myopically select the action with the smallest immediate count. The latter also acts badly in (1,2) and (2,3). There, it selects “left” which has count one, instead of “up” which has count zero.</p> "> Figure 4
<p><b>Toy example (part 2)</b>. Visitation count (<b>a</b>) and behavior policies derived from the proposed W-functions (<b>b</b>,<b>c</b>). Due to the low count, the exploration upper bound dominates the greedy Q-value, and the proposed behavior policies successfully achieve long-term exploration. Both policies prioritize unexecuted actions and avoid terminal and prison states.</p> "> Figure 5
<p>Chainworld with uniform count.</p> "> Figure 6
<p><b>Toy example (part 3)</b>. Behavior policies under uniform visitation count. In the top row, reward states (1,1) and (3,1) are terminal. In the bottom row, they are not and the MDP has an infinite horizon. All state-action pairs have been executed once, except for “left” in (1,3). Only the proposed methods (<b>d</b>,<b>e</b>) correctly identify such state-action pair as the most interesting for exploration and propagate this information to other states thanks to TD learning. When the action is executed, all policies converge to the greedy one (<b>a</b>).</p> "> Figure 7
<p>The deep sea.</p> "> Figure 8
<p><b>Results on the deep sea domain</b> averaged over 20 seeds. The proposed exploration achieves the best results, followed by bootstrapped exploration.</p> "> Figure 9
<p>The taxi driver.</p> "> Figure 10
<p><b>Results on the taxi</b> domain averaged over 20 seeds. The proposed algorithms outperform all others, being the only solving the task without optimistic initialization.</p> "> Figure 11
<p>The deep gridworld.</p> "> Figure 12
<p><b>Results on the deep gridworld</b> averaged over 20 seeds. Once again, the proposed algorithms are the only solving the task without optimistic initialization.</p> "> Figure 13
<p>The “toy” gridworld.</p> "> Figure 14
<p>The “prison” gridworld.</p> "> Figure 15
<p>The “wall” gridworld.</p> "> Figure 16
<p><b>Results on the toy gridworld</b> averaged over 20 seeds. Bootstrapped and bonus-based exploration perform well, but cannot match the proposed one (blue and orange line overlap).</p> "> Figure 17
<p><b>The “prison” and distractors</b> affect the performance of all algorithms but ours.</p> "> Figure 18
<p><b>The final gridworld</b> emphasizes the better performance of our algorithms.</p> "> Figure 19
<p><b>Visitation count at the end of the learning.</b> The seed is the same across all images. Initial states naturally have a higher count than other states. Recall that the upper portion of the deep sea and some gridworlds cells cannot be visited. Only the proposed methods explore the environment uniformly (figures show the count of UCB-based W-function. Count-based W-function performed very similarly). Other algorithms myopically focus on distractors.</p> "> Figure 20
<p><b>Performance of the our algorithms on varying of</b><math display="inline"><semantics> <msub> <mi>γ</mi> <mi>w</mi> </msub> </semantics></math> on the “toy” gridworld (<a href="#algorithms-15-00081-f014" class="html-fig">Figure 14</a>). The higher <math display="inline"><semantics> <msub> <mi>γ</mi> <mi>w</mi> </msub> </semantics></math>, the more the agent explores, allowing to discover the reward faster.</p> "> Figure 21
<p><b>Visitation count for</b><math display="inline"><semantics> <msubsup> <mi>W</mi> <mi mathvariant="normal">N</mi> <mi>β</mi> </msubsup> </semantics></math><b>at the end of the learning</b> on the same random seed (out of 20). The higher the discount, the more uniform the count is. The initial state (bottom left) has naturally higher counts because the agent always starts there. <math display="inline"><semantics> <msubsup> <mi>W</mi> <mrow> <mi>UCB</mi> </mrow> <mi>β</mi> </msubsup> </semantics></math> performed very similarly.</p> "> Figure 22
<p><b>Steps to learn on varying the deep sea size</b><math display="inline"><semantics> <mrow> <mi>N</mi> <mo>=</mo> <mn>5</mn> <mo>,</mo> <mn>7</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mn>59</mn> </mrow> </semantics></math>, averaged over ten runs with 500,000 steps limit. For filled dots, the run always converged, and we report the 95% confidence interval with error bars. For empty dots connected with dashed lines, the algorithm did not learn within the step limit at least once. In this case, the average is over converged runs only and we do not report any confidence interval. Using the exploration bonus (pink) learning barely converged once (most of its dots are either missing or empty). Using bootstrapping (red) all runs converged, but the results show a large confidence interval. With random prior (green) the learning rarely did not converge, and performed very similarly across the random seeds, as shown by its extremely small confidence interval. On the contrary, our method (blue, orange) is the only always converging with a confidence interval close to zero. Indeed, it outperforms all baseline by attaining the lowest sample complexity. Missing algorithms (random, <math display="inline"><semantics> <mi>ϵ</mi> </semantics></math>-greedy, UCB1, Thompson sampling) performed poorly and are not reported.</p> "> Figure 23
<p>The ergodic chainworld.</p> "> Figure 24
<p><b>Visitation counts for the chainworld</b> with 27 states (x-axis) over time (y-axis). As time passes (top to bottom), the visitation count grows. Only our algorithms and UCB1 explore uniformly, producing a uniform colormap. However, UCB1 finds the last state (with the reward) much later. Once UCB1 finds the reward ((<b>c</b>), step 1000) the visitation count increases only in the last state. This suggests that when UCB1 finds the reward contained in the last state it effectively stops exploring other states. By contrast, our algorithms keep exploring the environment for longer. This allows the agent to learn the true Q-function for all states and explains why our algorithms achieve lower sample complexity and MSVE.</p> "> Figure 25
<p><b>Sample complexity on varying the chain size</b><math display="inline"><semantics> <mrow> <mi>N</mi> <mo>=</mo> <mn>5</mn> <mo>,</mo> <mn>7</mn> <mo>,</mo> <mo>…</mo> <mo>,</mo> <mn>59</mn> </mrow> </semantics></math>. Error bars denote 95% confidence interval. Only the proposed algorithms (blue and orange lines almost overlap) show little sensitivity to <math display="inline"><semantics> <mi>ε</mi> </semantics></math>, as their sample complexity only slightly increases as <math display="inline"><semantics> <mi>ε</mi> </semantics></math> decreases. By contrast, other algorithms are highly influenced by <math display="inline"><semantics> <mi>ε</mi> </semantics></math>. Their estimate complexity has a large confidence interval, and for smaller <math display="inline"><semantics> <mi>ε</mi> </semantics></math> they rarely learn an <math display="inline"><semantics> <mi>ε</mi> </semantics></math>-optimal policy.</p> "> Figure 26
<p><b>Sample complexity against <math display="inline"><semantics> <mi>ε</mi> </semantics></math>, and MSVE</b> over time for 27- and 55-state chainworlds. Shaded areas denote 95% confidence interval. As shown in <a href="#algorithms-15-00081-f025" class="html-fig">Figure 25</a>, the sample complexity of our approach is barely influenced by <math display="inline"><semantics> <mi>ε</mi> </semantics></math>. Even for <math display="inline"><semantics> <mrow> <mi>ε</mi> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>, our algorithms attain low sample complexity, whereas other algorithms complexity is several orders of magnitude higher. Similarly, only our algorithms learn an almost perfect Q-function, achieving an MSVE close to zero.</p> "> Figure 27
<p><b>Results with stochastic transition function</b> on three of the previous MDPs. Once again, only our algorithms visit all states and learn the optimal policy within steps limit in all MDPs, and their performance is barely affected by the stochasticity.</p> "> Figure A1
<p>Comparison of the proposed exploration against <math display="inline"><semantics> <mi>ϵ</mi> </semantics></math>-greedy and bootstrapped exploration with and without replay memory. Each line denotes the average over 20 seeds. Only exploration based on the visitation value always learns in both domains, i.e., both with and without local optima (distractor rewards), regardless of the memory.</p> ">
Abstract
:1. Introduction
2. Preliminaries
2.1. Reinforcement Learning and Markov Decision Processes
2.2. Related Work
3. Long-Term Visitation Value for Deep Exploration
3.1. Example of the Limitations of Existing Immediate Count Methods
- It has executed all actions at least once in (1,3) and (2,2) (the prison cell),
- It still did not execute “up” in (1,2),
- It visited (1,3) once and experienced the reward of 1, and
- It still did not execute “up” and “right” in (2,1).
- The greedy policy (Figure 3b) points to the only reward collected so far.
- The UCB1 policy (Figure 3c) yields Q-values of much larger magnitude. Most of the actions have been executed a few times, thus their UCB is larger than their Q-value. The policy correctly selects unexecuted actions in (1,2) and (2,3) but fails (1,3). There, it also selects the least executed actions (“up” and “left”) which, however, have already been executed once. The agent should know that these actions do not bring it anywhere (the agent hits the environment boundaries and does not move) and should avoid them. However, it selects them because the policy is based on the immediate visitation count. Only when the count will be equal for all actions, the policy will select a random action. This behavior is clearly myopic and extremely inefficient.
- The policy learned using the auxiliary bonus (Figure 3d) is even worse, as it acts badly not only in (1,3) but also in (1,2) and (2,3). There, it chooses “left”, which was already selected once, instead of “up” (or “right in (2,3)), which instead has not been selected yet. The reason for such behavior is that this auxiliary reward requires optimistic initialization [30]. The agent, in fact, is positively rewarded for simply executing any action, with value proportional to the visitation count. However, if the Q-function is initialized to zero, the positive feedback will make the agent believe that the action is good. This mechanism encourages the agent to execute the same action again and hinders exploration. This problem is typical of all methods using an auxiliary reward based on visitation count unless optimistic initialization of the Q-table is used [30]. More recent methods use more sophisticated rewards [16,17]. However, despite their strong theoretical convergence guarantees, for large-size problems, these methods may require an impractical number of samples and are often outperformed by standard algorithms.
3.2. The Long-Term Visitation Value Approach
- In (1,3), both select “down” despite “up” and “left” having lower count. The agent, in fact, knows that from (1,2) it can go “down” to (1,1) where it executed only one action. Thus, it knows that there it can try new actions. On the other hand, the only thing it knows about (2,3) is that it leads to the prison state, where the agent already executed all actions. Thus, (1,2) has a higher long-term exploration value than (2,3).
- In (1,2) both select “up”, overturning the greedy action “down” (Figure 3b). This is crucial because to fully explore the environment we need to select new actions.
- The same happens in (2,3), where actions with count zero are selected, rather than the greedy ones.
- None of the policies lead the agent to the prison state, because it has already been visited and all actions have been executed in there.
- They both do not select “right” in (2,2) because it has been executed one more time than other actions. Nonetheless, the difference in the action value is minimal, as shown by the action colormap (almost uniform).
3.3. W-Function Initialization and Bounds
- The augmented reward policy (Figure 6b) has no interest in trying “left” in (1,3), and its value is the lowest. The agent, in fact, cannot get the visitation bonus without first executing the action. This shows once more that this passive way of exploration is highly inefficient.
- UCB1 policy (Figure 6c) acts greedily everywhere except in (1,3), where it correctly selects the unexecuted action. However, since UCB1 considers only the immediate visitation, it has no way of propagating this information to other state-action pairs.
- By contrast, for the proposed W-function policies (Figure 6d,e) the unexecuted state-action pair is the most interesting, and, from any state, the policy brings the agent there. Thanks to the separate W-function, we can propagate the value of unexecuted actions to other states, always prioritizing the exploration of new state-action pairs. Under uniform count , then the agent acts greedily w.r.t. the Q-function.
3.4. Final Remarks
4. Evaluation
4.1. Part 1: Discounted Return and States Discovered
4.1.1. Deep Sea
4.1.2. Multi-Passenger Taxi Driver
4.1.3. Deep Gridworld
4.1.4. Gridworlds
4.2. Part 2: In-Depth Investigation
4.2.1. Visitation Count at the End of the Learning
4.2.2. Impact of Visitation Value Discount Factor
4.2.3. Empirical Sample Complexity on the Deep Sea
4.2.4. Infinite Horizon Stochastic Chainworld
4.2.5. Stochastic Gridworlds
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Experiment Details
Appendix A.1. The Generic Q-Learning Scheme
Algorithm A1: Tabular Q-Learning with Replay Memory |
Appendix A.2. Q-Learning with Augmented Reward
Algorithm A2: Tabular Q-Learning with Replay Memory and Augmented Reward |
Appendix A.3. Q-Learning with Visitation Value
Algorithm A3: Tabular Q-Learning with Replay Memory and Visit. Value (UCB) |
Algorithm A4: Tabular Q-Learning with Replay Memory and Visit. Value (Count) |
Appendix A.4. Bootstrapped Q-Learning
Appendix A.5. Horizons
Deep Sea | Taxi | Deep Grid. | Grid. (Toy) | Grid. (Prison) | Grid. (Wall) | |
---|---|---|---|---|---|---|
Optimal | Depth | 29 | 11 | 9 | 9 | 135 |
Short H. | Depth | 33 | 55 | 11 | 11 | 330 |
Long H. | Depth | 66 | 110 | 22 | 22 | 660 |
Stochastic | - | - | 55 | 15 | 25 | - |
Appendix A.6. Infinite Memory vs. No Memory
Algorithm A5: Classic Tabular Q-Learning |
Algorithm A6: Tabular Bootstrapped Q-Learning with Replay Memory |
Appendix A.7. Recap Tables
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Short Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Zero Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Opt. Init. & Long Hor. | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
References
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari With Deep Reinforcement Learning. In Proceedings of the NIPS Workshop on Deep Learning, Lake Tahoe, CA, USA, 5 December 2013. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar]
- Osband, I.; Roy, B.V.; Russo, D.J.; Wen, Z. Deep Exploration via Randomized Value Functions. J. Mach. Learn. Res. 2019, 20, 1–62. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2 May 2016. [Google Scholar]
- Kakade, S. On the Sample Complexity of Reinforcement Learning. Ph.D. Thesis, University College London, London, UK, 2003. [Google Scholar]
- Szepesvari, C. Algorithms for Reinforcement Learning; Morgan & Claypool Publishers: Williston, VT, USA, 2010; Volume 4. [Google Scholar]
- Osband, I.; Van Roy, B.; Wen, Z. Generalization and Exploration via Randomized Value Functions. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19 June 2016. [Google Scholar]
- Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep exploration via bootstrapped DQN. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 5 December 2016. [Google Scholar]
- Kearns, M.; Singh, S. Near-optimal reinforcement learning in polynomial time. Mach. Learn. 2002, 49, 209–232. [Google Scholar] [CrossRef] [Green Version]
- Brafman, R.I.; Tennenholtz, M. R-MAX—A general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 2002, 3, 213–231. [Google Scholar]
- Jaksch, T.; Ortner, R.; Auer, P. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 2010, 11, 1563–1600. [Google Scholar]
- Hester, T.; Stone, P. TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Mach. Learn. 2013, 90, 385–429. [Google Scholar] [CrossRef] [Green Version]
- Strehl, A.L.; Li, L.; Wiewiora, E.; Langford, J.; Littman, M.L. PAC model-free reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25 June 2006; pp. 881–888. [Google Scholar]
- Bellemare, M.G.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 5 December 2016; pp. 1471–1479. [Google Scholar]
- Jin, C.; Allen-Zhu, Z.; Bubeck, S.; Jordan, M.I. Is Q-learning provably efficient? In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 3 December 2018; pp. 4863–4873. [Google Scholar]
- Dong, K.; Wang, Y.; Chen, X.; Wang, L. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP. In Proceedings of the International Conference on Learning Representation (ICLR), Virtual, 26 April 2020. [Google Scholar]
- Kolter, J.Z.; Ng, A.Y. Near-Bayesian exploration in polynomial time. In Proceedings of the International Conference on Machine Learning (ICML), Montreal, QC, Canada, 24 June 2009; pp. 513–520. [Google Scholar]
- Houthooft, R.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. VIME: Variational information maximizing exploration. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 5 December 2016; pp. 1109–1117. [Google Scholar]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6 August 2017. [Google Scholar]
- Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; Wiley-Interscience: Chichester, UK, 1994; p. 694. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Strens, M. A Bayesian framework for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Stanford, CA, USA, 29 June 2000; pp. 943–950. [Google Scholar]
- Poupart, P.; Vlassis, N.; Hoey, J.; Regan, K. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25 June 2006; pp. 697–704. [Google Scholar]
- Lai, T.; Robbins, H. Asymptotically Efficient Adaptive Allocation Rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef] [Green Version]
- Auer, P.; Ortner, R. Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 3 December 2007; pp. 49–56. [Google Scholar]
- Dann, C.; Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7 December 2015; pp. 2818–2826. [Google Scholar]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
- Ryan, R.M.; Deci, E.L. Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemp. Educ. Psychol. 2000, 25, 54–67. [Google Scholar] [CrossRef]
- Strehl, A.L.; Littman, M.L. An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 2008, 74, 1309–1331. [Google Scholar] [CrossRef]
- Raileanu, R.; Rocktäschel, T. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 26 April 2020. [Google Scholar]
- Stadie, B.C.; Levine, S.; Abbeel, P. Incentivizing exploration in reinforcement learning with deep predictive models. In Proceedings of the NIPS Workshop on Deep Reinforcement Learning, Montreal, QC, Canada, 7 December 2015. [Google Scholar]
- Schmidhuber, J. A Possibility for lmplementing Curiosity and Boredom in Model-Building Neural Controllers. In Proceedings of the International Conference on Simulation of Adaptive Behavior (SAB), Paris, France, 25 August 1991; pp. 222–227. [Google Scholar]
- Schmidhuber, J. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connect. Sci. 2006, 18, 173–187. [Google Scholar] [CrossRef]
- Pathak, D.; Gandhi, D.; Gupta, A. Self-Supervised Exploration via Disagreement. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9 June 2019. [Google Scholar]
- Burda, Y.; Edwards, H.; Storkey, A.; Klimov, O. Exploration by random network distillation. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6 May 2019. [Google Scholar]
- Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25, 285–294. [Google Scholar] [CrossRef]
- Kaufmann, E.; Korda, N.; Munos, R. Thompson sampling: An asymptotically optimal finite-time analysis. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), Lyon, France, 29 October 2012. [Google Scholar]
- Agrawal, S.; Goyal, N. Further Optimal Regret Bounds for Thompson Sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Scottsdale, AZ, USA, 29 April 2013. [Google Scholar]
- Scott, S.L. A modern Bayesian look at the multi-armed bandit. Appl. Stoch. Model. Bus. Ind. 2010, 26, 639–658. [Google Scholar] [CrossRef]
- Chapelle, O.; Li, L. An empirical evaluation of Thompson sampling. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Granada, Spain, 12 December 2011; pp. 2249–2257. [Google Scholar]
- Russo, D.; Van Roy, B. Learning to optimize via posterior sampling. Math. Oper. Res. 2014, 39, 1221–1243. [Google Scholar] [CrossRef] [Green Version]
- Russo, D.; Tse, D.; Van Roy, B. Time-sensitive bandit learning and satisficing thompson sampling. arXiv 2017, arXiv:1704.09028. [Google Scholar]
- Russo, D.J.; Van Roy, B.; Kazerouni, A.; Osband, I.; Wen, Z. A tutorial on Thompson sampling. Found. Trends Mach. Learn. 2018, 11, 1–96. [Google Scholar] [CrossRef]
- Osband, I.; Russo, D.; Van Roy, B. (More) efficient reinforcement learning via posterior sampling. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, CA, USA, 5 December 2013. [Google Scholar]
- D’Eramo, C.; Cini, A.; Restelli, M. Exploiting Action-Value Uncertainty to Drive Exploration in Reinforcement Learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14 July 2019. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24 April 2017. [Google Scholar]
- Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 3 April 2018. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine learning (ICML), New York, NY, USA, 19 June 2016. [Google Scholar]
- Osband, I. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In Proceedings of the NIPS Workshop on Bayesian Deep Learning, Barcelona, Spain, 5 December 2016. [Google Scholar]
- Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans; SIAM: Philadelphia, PA, USA, 1982. [Google Scholar]
- Osband, I.; Aslanides, J.; Cassirer, A. Randomized prior functions for deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 3 December 2018; pp. 8617–8629. [Google Scholar]
- Van Hasselt, H. Double Q-learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 6 December 2010. [Google Scholar]
- D’Eramo, C.; Nuara, A.; Pirotta, M.; Restelli, M. Estimating the maximum expected value in continuous reinforcement learning problems. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20 April 2017. [Google Scholar]
- Tateo, D.; D’Eramo, C.; Nuara, A.; Restelli, M.; Bonarini, A. Exploiting structure and uncertainty of Bellman updates in Markov decision processes. In Proceedings of the Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November 2017. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, O.X.; Duan, Y.; Schulman, J.; DeTurck, F.; Abbeel, P. #Exploration: A study of count-based exploration for deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 24 January 2017; pp. 2753–2762. [Google Scholar]
- Ostrovski, G.; Bellemare, M.G.; van den Oord, A.; Munos, R. Count-based exploration with neural density models. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6 August 2017; pp. 2721–2730. [Google Scholar]
- Asadi, K.; Littman, M.L. An alternative softmax operator for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6 August 2017. [Google Scholar]
- Sutton, R.S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 1990; pp. 216–224. [Google Scholar]
Algorithm | Discovery (%) | Success (%) | |
---|---|---|---|
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Deep Sea | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Taxi | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Deep Gridworld | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Gridworld (Toy) | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Gridworld (Prison) | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random | |||
Ours (UCB Reward) | |||
Ours (Count Reward) | |||
Rand. Prior (Osband 2019) | |||
Bootstr. (Osband 2016a) | |||
Gridworld (Wall) | Bootstr. (D’Eramo 2019) | ||
UCB1 (Auer 2002) | |||
Expl. Bonus (Strehl 2008) | |||
-greedy | |||
Random |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Parisi, S.; Tateo, D.; Hensel, M.; D’Eramo, C.; Peters, J.; Pajarinen, J. Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning. Algorithms 2022, 15, 81. https://doi.org/10.3390/a15030081
Parisi S, Tateo D, Hensel M, D’Eramo C, Peters J, Pajarinen J. Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning. Algorithms. 2022; 15(3):81. https://doi.org/10.3390/a15030081
Chicago/Turabian StyleParisi, Simone, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, and Joni Pajarinen. 2022. "Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning" Algorithms 15, no. 3: 81. https://doi.org/10.3390/a15030081
APA StyleParisi, S., Tateo, D., Hensel, M., D’Eramo, C., Peters, J., & Pajarinen, J. (2022). Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning. Algorithms, 15(3), 81. https://doi.org/10.3390/a15030081