[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Revisiting experience replayable conditions

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict “experience replayable conditions” (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

  1. Bejjani et al. [3] modified SARSA to deep Q-networks (DQN) [36], an off-policy algorithm, in subsequent updates [4].

  2. Theoretically, their learning rules corresponds to the expected SARSA [55], which is an off-policy algorithm, but their implementations are with a rough Monte Carlo approximation and would lack rigor.

  3. Continuous spaces are assumed without loss of generality to match the experiments in this paper.

  4. The actual implementation has two Q functions to conservatively compute the target value.

  5. Although KL divergence does not actually satisfy the definition of distance, it is widely used in probability geometry because of its distance-like property, which is non-negative and zero only when two probability distributions coincide.

  6. The same is true if the second term is multiplied by the gain \(\lambda \in (0, 1)\).

References

  1. Banerjee C, Chen Z, Noman N (2024) Improved soft actor-critic: mixing prioritized off-policy samples with on-policy experiences. IEEE Trans Neural Netw Learn Syst 35(3):3121–3129

    Article  Google Scholar 

  2. Barron JT (2021) Squareplus: a softplus-like algebraic rectifier. arXiv preprint arXiv:2112.11687

  3. Bejjani W, Papallas R, Leonetti M et al (2018a) Planning with a receding horizon for manipulation in clutter using a learned value function. arXiv:1803.08100

  4. Bejjani W, Papallas R, Leonetti M et al (2018b) Planning with a receding horizon for manipulation in clutter using a learned value function. In: IEEE-RAS international conference on humanoid robots. IEEE, pp 1–9

  5. Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature

  6. Caggiano V, Wang H, Durandau G et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. In: Learning for dynamics and control conference. PMLR, pp 492–507

  7. Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Trans Intell Transp Syst 23(6):5068–5078

    Article  Google Scholar 

  8. Cheng D, Gong Y, Zhou S et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344

  9. Christianos F, Schäfer L, Albrecht S (2020) Shared experience actor-critic for multi-agent reinforcement learning. Adv Neural Inf Process Syst 33:10707–10717

    Google Scholar 

  10. Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. J Field Robot 38(3):331–354

    Article  Google Scholar 

  11. Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International conference on machine learning, pp 179–186

  12. Fakoor R, Chaudhari P, Smola AJ (2020) P3O: policy-on policy-off policy optimization. In: Uncertainty in artificial intelligence. PMLR, pp 1017–1027

  13. Fedus W, Ramachandran P, Agarwal R et al (2020) Revisiting fundamentals of experience replay. In: International conference on machine learning. PMLR, pp 3061–3071

  14. Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning. PMLR, pp 1587–1596

  15. Ganin Y, Ustinova E, Ajakan H et al (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35

    MathSciNet  Google Scholar 

  16. Gu SS, Lillicrap T, Turner RE et al (2017) Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. Adv Neural Inf Process Syst 30:3849–3858

    Google Scholar 

  17. Haarnoja T, Zhou A, Abbeel P et al (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning. PMLR, pp 1861–1870

  18. Haarnoja T, Zhou A, Hartikainen K et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905

  19. Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Math Financ 33(3):437–503

    Article  MathSciNet  Google Scholar 

  20. Hansen S, Pritzel A, Sprechmann P et al (2018) Fast deep reinforcement learning using online adjustments from the past. Adv Neural Inf Process Syst 31:10590–10600

    Google Scholar 

  21. Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692

    Article  Google Scholar 

  22. Kalashnikov D, Irpan A, Pastor P et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on robot learning. PMLR, pp 651–673

  23. Kapturowski S, Ostrovski G, Quan J et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations

  24. Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335–4347

    Article  Google Scholar 

  25. Kobayashi T (2022a) L2c2: locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International conference on intelligent robots and systems. IEEE, pp 4032–4039

  26. Kobayashi T (2022b) Optimistic reinforcement learning by forward kullback-leibler divergence optimization. Neural Netw 152:169–180

    Article  Google Scholar 

  27. Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:2308.12772

  28. Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results Control Optim 10:100192

    Article  Google Scholar 

  29. Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:2303.04356

  30. Kobayashi T (2024) Consolidated adaptive t-soft update for deep reinforcement learning. In: IEEE World congress on computational intelligence

  31. Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Adv Robot 37(12):719–736

    Article  Google Scholar 

  32. Levine S (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909

  33. Lillicrap TP, Hunt JJ, Pritzel A et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations

  34. Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8(3–4):293–321

    Article  Google Scholar 

  35. Liu X, Zhu T, Jiang C et al (2022) Prioritized experience replay based on multi-armed bandit. Expert Syst Appl 189:116023

    Article  Google Scholar 

  36. Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  37. Mnih V, Badia AP, Mirza M et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937

  38. Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International conference on machine learning. PMLR, pp 4851–4860

  39. Oh I, Rho S, Moon S et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Trans on Games 14(2):212–220

    Article  Google Scholar 

  40. Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Adv Neural Inf Process Syst 31:8626–8638

    Google Scholar 

  41. Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International conference on artificial intelligence and statistics. PMLR, pp 4078–4086

  42. Paszke A, Gross S, Massa F et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8026–8037

    Google Scholar 

  43. Saglam B, Mutlu FB, Cicek DC et al (2023) Actor prioritized experience replay. J Artif Intell Res 78:639–672

    Article  MathSciNet  Google Scholar 

  44. Schaul T, Quan J, Antonoglou I et al (2016) Prioritized experience replay. In: International conference on learning representations

  45. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823

  46. Schulman J, Moritz P, Levine S et al (2016) High-dimensional continuous control using generalized advantage estimation. In: International conference on learning representations

  47. Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  48. Sinha S, Song J, Garg A et al (2022) Experience replay with likelihood-free importance weights. In: Learning for dynamics and control conference. PMLR, pp 110–123

  49. Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  Google Scholar 

  50. Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International conference on machine learning. PMLR, pp 9133–9143

  51. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press

  52. Tai JJ, Wong J, Innocente M et al (2023) Pyflyt–uav simulation environments for reinforcement learning research. arXiv preprint arXiv:2304.01305

  53. Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 5026–5033

  54. Tunyasuvunakool S, Muldal A, Doron Y et al (2020) dm_control: software and tasks for continuous control. Softw Impacts 6:100022

    Article  Google Scholar 

  55. Van Seijen H, Van Hasselt H, Whiteson S et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning. IEEE, pp 177–184

  56. Wang J, Song Y, Leung T et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393

  57. Wang X, Song J, Qi P et al (2021) Scc: an efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning. PMLR, pp 10905–10915

  58. Wang Z, Bapst V, Heess N et al (2017) Sample efficient actor-critic with experience replay. In: International conference on learning representations

  59. Wei W, Wang D, Li L et al (2024) Re-attentive experience replay in off-policy reinforcement learning. Machine Learning, pp 1–23

  60. Wu P, Escontrela A, Hafner D et al (2023) Daydreamer: world models for physical robot learning. In: Conference on robot learning. PMLR, pp 2226–2240

  61. Xuan H, Stylianou A, Liu X et al (2020) Hard negative examples are hard, but useful. In: European conference on computer vision, pp 126–142

  62. Yu B, Liu T, Gong M et al (2018) Correcting the triplet selection bias for triplet loss. In: European conference on computer vision, pp 71–87

  63. Zhang B, Sennrich R (2019) Root mean square layer normalization. Adv Neural Inf Process Syst 32:12381–12392

    Google Scholar 

  64. Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Adv Neural Inf Process Syst 32:2001–2011

  65. Zhao D, Wang H, Shao K et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence. IEEE, pp 1–6

Download references

Acknowledgements

This research was supported by “Strategic Research Projects” grant from ROIS (Research Organization of Information and Systems).

Author information

Authors and Affiliations

Authors

Contributions

Taisuke Kobayashi contributed to everything for this paper: Conceptualization, Methodology, Software, Validation, Investigation, Visualization, Funding acquisition, and Writing.

Corresponding author

Correspondence to Taisuke Kobayashi.

Ethics declarations

Competing Interests

The author declares that there is no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Compliance with Ethical Standards

The data used in this study was exclusively generated by the author. No research involving human participants or animals has been performed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Details of implementation

The algorithms used in this paper is implemented with Pytorch [42]. This implementation is based on the one in the literature [27]. The characteristic hyperparameters in this implementation are listed up in Table 1.

Table 1 Parameter configuration

The policy and value functions are approximated by the fully-connected neural networks consisting of two hidden layers with 100 neurons for each. As activation functions, Squish function [2, 31] and RMSNorm [63] are combined. AdaTerm [21], which is the noise-robust optimizer, is adopted for robustly optimizing parameters against noises caused by bootstrapped learning in RL. Similarly, the target networks that are updated by CAT-soft update [30] are employed to stabilize learning (and to prevent learning speed degradation). In A2C, the target networks are applied to both the policy \(\pi \) and the value function V, while in SAC they are applied only to the action value function Q. In addition, A2C enhances output continuity by using L2C2 [25] with default parameters, although SAC does not so due to reproduction of its standard implementation.

Both A2C and SAC policies are modeled as Student’s t-distribution with high global exploration capability [24]. Therefore, the outputs from the networks are three model parameters: position, scale, and degrees of freedom. However, as in the standard implementation of SAC, the process of converting the generated action to a bounded range is also explicitly considered as a probability distribution. On the other hand, in A2C, this process is performed implicitly on the environment side and is not reflected in the probability distribution.

SAC approximates the two action value functions \(Q_{1,2}\) with independent networks as in the standard implementation, and aims at conservative learning by selecting the smaller value. On the other hand, A2C aims at stable learning by outputting 10 values from shared networks and using the median as the representative value. In order to enhance the effect of ensemble learning, the outputs are computed with both learnable and unlearnable parameters, so that each output can easily take on different values (especially in unlearned regions) [40].

A2C and SAC share the settings of ER, with a buffer size of 102,400 and a batch size of 256. The replay buffer is in FIFO format that deletes the oldest empirical data when the buffer size is exceeded. At the end of each episode, half of the empirical data stored in ER is replayed uniformly at random.

Fig. 9
figure 9

Returns with PPO on PyFlyt: From (a) and (b), it can be found that the mining trick improves learning speed, while the counteraction trick facilitates exploration, which yielded the maximum and stable return on Fixedwing in (c)

Appendix B: Application to PPO

The proposed stabilization tricks are applied to PPO [47], a latest on-policy algorithm other than A2C. PPO multiplies the policy improvement by the policy likelihood ratio using importance sampling, and it is clipped to force the gradient to zero if it is excessive. The recommended clipping threshold, 0.2, is employed. Since this paper uses an ER that accumulates empirical data for the simplest single transition, GAE [46], which is often used in conjunction with PPO, is ignored. In addition, to check the regularization effects by the stabilization tricks, an extra regularization term, i.e. the policy entropy, is also omitted. As mentioned in Introduction, as PPO is empirically ER-applicable, it is possible to quantitatively compare the learning trends and changes in the final policies due to each stabilization trick.

With the above setup, the results of solving QuadX-Waypoints-v2 and Fixedwing-Waypoints-v2 provided by PyFlyt [52] are summarized below. Note that these tasks intend to control different drones (i.e. a quadrotor and a fixed-wing drone), respectively. The learning curves for 7 trials and test results after learning in the four conditions diverging with and without the two stabilization tricks are depicted in Fig. 9.

First, the vanilla PPO shows the remarkable but expected results. Although PPO is originally considered an on-policy algorithm, as described in the text, it achieved learning of both tasks in combination with ER even without the addition of the proposed stabilization tricks. This is because, as discussed in Section 6.1, PPO performs regularization and clipping heuristically such that \(\pi \simeq b\) (i.e. on-policyness) holds. In fact, PPO with only one of the stabilization trick did not saturate the corresponding internal parameter, indicating that ERC was satisfied without the lack of its functionality. Note, however, that PPO by itself may not be sufficient to satisfy ERC, since GAE, which was omitted from the implementation this time, relies more heavily on the behavior policy than the simple advantage function (i.e. TD error).

Next, the contributions of the proposed stabilization tricks are confirmed. First, it is easy to see that the mining trick increases learning speed, while the counteraction trick tends to decrease it. This may be due to the fact that the direction of policy improvements becomes clearer by making the mining trick replay only the empirical data that are useful for learning, while the counteraction trick restricts policy improvements by binding \(\pi \simeq b\). For the former, actually, the scale of TD error, which implies the learning direction, was increased when the mining trick was added.

On the other hand, the counteraction trick seems to improve the exploration capability in exchange for the learning speed. The return at the end of learning was maximized by including the counteraction trick on Fixedwing, although the return on Quadrotor with it was lower than others due to slow convergence. This may be due to the increase in entropy of \(\pi \) by regularizing it to various the behavior policies in the replay buffer. In fact, the addition of the counteraction trick yielded the decrease of \(\ln \pi \), which corresponds to the negative entropy, during learning.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kobayashi, T. Revisiting experience replayable conditions. Appl Intell 54, 9381–9394 (2024). https://doi.org/10.1007/s10489-024-05685-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05685-7

Keywords

Navigation