Abstract
Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict “experience replayable conditions” (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Notes
Theoretically, their learning rules corresponds to the expected SARSA [55], which is an off-policy algorithm, but their implementations are with a rough Monte Carlo approximation and would lack rigor.
Continuous spaces are assumed without loss of generality to match the experiments in this paper.
The actual implementation has two Q functions to conservatively compute the target value.
Although KL divergence does not actually satisfy the definition of distance, it is widely used in probability geometry because of its distance-like property, which is non-negative and zero only when two probability distributions coincide.
The same is true if the second term is multiplied by the gain \(\lambda \in (0, 1)\).
References
Banerjee C, Chen Z, Noman N (2024) Improved soft actor-critic: mixing prioritized off-policy samples with on-policy experiences. IEEE Trans Neural Netw Learn Syst 35(3):3121–3129
Barron JT (2021) Squareplus: a softplus-like algebraic rectifier. arXiv preprint arXiv:2112.11687
Bejjani W, Papallas R, Leonetti M et al (2018a) Planning with a receding horizon for manipulation in clutter using a learned value function. arXiv:1803.08100
Bejjani W, Papallas R, Leonetti M et al (2018b) Planning with a receding horizon for manipulation in clutter using a learned value function. In: IEEE-RAS international conference on humanoid robots. IEEE, pp 1–9
Bellet A, Habrard A, Sebban M (2022) Metric learning. Springer Nature
Caggiano V, Wang H, Durandau G et al (2022) Myosuite–a contact-rich simulation suite for musculoskeletal motor control. In: Learning for dynamics and control conference. PMLR, pp 492–507
Chen J, Li SE, Tomizuka M (2021) Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Trans Intell Transp Syst 23(6):5068–5078
Cheng D, Gong Y, Zhou S et al (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE conference on computer vision and pattern recognition, pp 1335–1344
Christianos F, Schäfer L, Albrecht S (2020) Shared experience actor-critic for multi-agent reinforcement learning. Adv Neural Inf Process Syst 33:10707–10717
Cui Y, Osaki S, Matsubara T (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. J Field Robot 38(3):331–354
Degris T, White M, Sutton RS (2012) Off-policy actor-critic. In: International conference on machine learning, pp 179–186
Fakoor R, Chaudhari P, Smola AJ (2020) P3O: policy-on policy-off policy optimization. In: Uncertainty in artificial intelligence. PMLR, pp 1017–1027
Fedus W, Ramachandran P, Agarwal R et al (2020) Revisiting fundamentals of experience replay. In: International conference on machine learning. PMLR, pp 3061–3071
Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: International conference on machine learning. PMLR, pp 1587–1596
Ganin Y, Ustinova E, Ajakan H et al (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(59):1–35
Gu SS, Lillicrap T, Turner RE et al (2017) Interpolated policy gradient: merging on-policy and off-policy gradient estimation for deep reinforcement learning. Adv Neural Inf Process Syst 30:3849–3858
Haarnoja T, Zhou A, Abbeel P et al (2018a) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning. PMLR, pp 1861–1870
Haarnoja T, Zhou A, Hartikainen K et al (2018b) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905
Hambly B, Xu R, Yang H (2023) Recent advances in reinforcement learning in finance. Math Financ 33(3):437–503
Hansen S, Pritzel A, Sprechmann P et al (2018) Fast deep reinforcement learning using online adjustments from the past. Adv Neural Inf Process Syst 31:10590–10600
Ilboudo WEL, Kobayashi T, Matsubara T (2023) Adaterm: adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization. Neurocomputing 557:126692
Kalashnikov D, Irpan A, Pastor P et al (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on robot learning. PMLR, pp 651–673
Kapturowski S, Ostrovski G, Quan J et al (2019) Recurrent experience replay in distributed reinforcement learning. In: International conference on learning representations
Kobayashi T (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Appl Intell 49(12):4335–4347
Kobayashi T (2022a) L2c2: locally lipschitz continuous constraint towards stable and smooth reinforcement learning. In: IEEE/RSJ International conference on intelligent robots and systems. IEEE, pp 4032–4039
Kobayashi T (2022b) Optimistic reinforcement learning by forward kullback-leibler divergence optimization. Neural Netw 152:169–180
Kobayashi T (2023a) Intentionally-underestimated value function at terminal state for temporal-difference learning with mis-designed reward. arXiv preprint arXiv:2308.12772
Kobayashi T (2023b) Proximal policy optimization with adaptive threshold for symmetric relative density ratio. Results Control Optim 10:100192
Kobayashi T (2023c) Soft actor-critic algorithm with truly-satisfied inequality constraint. arXiv preprint arXiv:2303.04356
Kobayashi T (2024) Consolidated adaptive t-soft update for deep reinforcement learning. In: IEEE World congress on computational intelligence
Kobayashi T, Aotani T (2023) Design of restricted normalizing flow towards arbitrary stochastic policy with computational efficiency. Adv Robot 37(12):719–736
Levine S (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909
Lillicrap TP, Hunt JJ, Pritzel A et al (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations
Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8(3–4):293–321
Liu X, Zhu T, Jiang C et al (2022) Prioritized experience replay based on multi-armed bandit. Expert Syst Appl 189:116023
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Mnih V, Badia AP, Mirza M et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937
Novati G, Koumoutsakos P (2019) Remember and forget for experience replay. In: International conference on machine learning. PMLR, pp 4851–4860
Oh I, Rho S, Moon S et al (2021) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. IEEE Trans on Games 14(2):212–220
Osband I, Aslanides J, Cassirer A (2018) Randomized prior functions for deep reinforcement learning. Adv Neural Inf Process Syst 31:8626–8638
Parmas P, Sugiyama M (2021) A unified view of likelihood ratio and reparameterization gradients. In: International conference on artificial intelligence and statistics. PMLR, pp 4078–4086
Paszke A, Gross S, Massa F et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8026–8037
Saglam B, Mutlu FB, Cicek DC et al (2023) Actor prioritized experience replay. J Artif Intell Res 78:639–672
Schaul T, Quan J, Antonoglou I et al (2016) Prioritized experience replay. In: International conference on learning representations
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, pp 815–823
Schulman J, Moritz P, Levine S et al (2016) High-dimensional continuous control using generalized advantage estimation. In: International conference on learning representations
Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Sinha S, Song J, Garg A et al (2022) Experience replay with likelihood-free importance weights. In: Learning for dynamics and control conference. PMLR, pp 110–123
Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Stooke A, Achiam J, Abbeel P (2020) Responsive safety in reinforcement learning by pid lagrangian methods. In: International conference on machine learning. PMLR, pp 9133–9143
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press
Tai JJ, Wong J, Innocente M et al (2023) Pyflyt–uav simulation environments for reinforcement learning research. arXiv preprint arXiv:2304.01305
Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 5026–5033
Tunyasuvunakool S, Muldal A, Doron Y et al (2020) dm_control: software and tasks for continuous control. Softw Impacts 6:100022
Van Seijen H, Van Hasselt H, Whiteson S et al (2009) A theoretical and empirical analysis of expected sarsa. In: IEEE symposium on adaptive dynamic programming and reinforcement learning. IEEE, pp 177–184
Wang J, Song Y, Leung T et al (2014) Learning fine-grained image similarity with deep ranking. In: IEEE conference on computer vision and pattern recognition, pp 1386–1393
Wang X, Song J, Qi P et al (2021) Scc: an efficient deep reinforcement learning agent mastering the game of starcraft ii. In: International conference on machine learning. PMLR, pp 10905–10915
Wang Z, Bapst V, Heess N et al (2017) Sample efficient actor-critic with experience replay. In: International conference on learning representations
Wei W, Wang D, Li L et al (2024) Re-attentive experience replay in off-policy reinforcement learning. Machine Learning, pp 1–23
Wu P, Escontrela A, Hafner D et al (2023) Daydreamer: world models for physical robot learning. In: Conference on robot learning. PMLR, pp 2226–2240
Xuan H, Stylianou A, Liu X et al (2020) Hard negative examples are hard, but useful. In: European conference on computer vision, pp 126–142
Yu B, Liu T, Gong M et al (2018) Correcting the triplet selection bias for triplet loss. In: European conference on computer vision, pp 71–87
Zhang B, Sennrich R (2019) Root mean square layer normalization. Adv Neural Inf Process Syst 32:12381–12392
Zhang S, Boehmer W, Whiteson S (2019) Generalized off-policy actor-critic. Adv Neural Inf Process Syst 32:2001–2011
Zhao D, Wang H, Shao K et al (2016) Deep reinforcement learning with experience replay based on sarsa. In: IEEE symposium series on computational intelligence. IEEE, pp 1–6
Acknowledgements
This research was supported by “Strategic Research Projects” grant from ROIS (Research Organization of Information and Systems).
Author information
Authors and Affiliations
Contributions
Taisuke Kobayashi contributed to everything for this paper: Conceptualization, Methodology, Software, Validation, Investigation, Visualization, Funding acquisition, and Writing.
Corresponding author
Ethics declarations
Competing Interests
The author declares that there is no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Compliance with Ethical Standards
The data used in this study was exclusively generated by the author. No research involving human participants or animals has been performed.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Details of implementation
The algorithms used in this paper is implemented with Pytorch [42]. This implementation is based on the one in the literature [27]. The characteristic hyperparameters in this implementation are listed up in Table 1.
The policy and value functions are approximated by the fully-connected neural networks consisting of two hidden layers with 100 neurons for each. As activation functions, Squish function [2, 31] and RMSNorm [63] are combined. AdaTerm [21], which is the noise-robust optimizer, is adopted for robustly optimizing parameters against noises caused by bootstrapped learning in RL. Similarly, the target networks that are updated by CAT-soft update [30] are employed to stabilize learning (and to prevent learning speed degradation). In A2C, the target networks are applied to both the policy \(\pi \) and the value function V, while in SAC they are applied only to the action value function Q. In addition, A2C enhances output continuity by using L2C2 [25] with default parameters, although SAC does not so due to reproduction of its standard implementation.
Both A2C and SAC policies are modeled as Student’s t-distribution with high global exploration capability [24]. Therefore, the outputs from the networks are three model parameters: position, scale, and degrees of freedom. However, as in the standard implementation of SAC, the process of converting the generated action to a bounded range is also explicitly considered as a probability distribution. On the other hand, in A2C, this process is performed implicitly on the environment side and is not reflected in the probability distribution.
SAC approximates the two action value functions \(Q_{1,2}\) with independent networks as in the standard implementation, and aims at conservative learning by selecting the smaller value. On the other hand, A2C aims at stable learning by outputting 10 values from shared networks and using the median as the representative value. In order to enhance the effect of ensemble learning, the outputs are computed with both learnable and unlearnable parameters, so that each output can easily take on different values (especially in unlearned regions) [40].
A2C and SAC share the settings of ER, with a buffer size of 102,400 and a batch size of 256. The replay buffer is in FIFO format that deletes the oldest empirical data when the buffer size is exceeded. At the end of each episode, half of the empirical data stored in ER is replayed uniformly at random.
Appendix B: Application to PPO
The proposed stabilization tricks are applied to PPO [47], a latest on-policy algorithm other than A2C. PPO multiplies the policy improvement by the policy likelihood ratio using importance sampling, and it is clipped to force the gradient to zero if it is excessive. The recommended clipping threshold, 0.2, is employed. Since this paper uses an ER that accumulates empirical data for the simplest single transition, GAE [46], which is often used in conjunction with PPO, is ignored. In addition, to check the regularization effects by the stabilization tricks, an extra regularization term, i.e. the policy entropy, is also omitted. As mentioned in Introduction, as PPO is empirically ER-applicable, it is possible to quantitatively compare the learning trends and changes in the final policies due to each stabilization trick.
With the above setup, the results of solving QuadX-Waypoints-v2 and Fixedwing-Waypoints-v2 provided by PyFlyt [52] are summarized below. Note that these tasks intend to control different drones (i.e. a quadrotor and a fixed-wing drone), respectively. The learning curves for 7 trials and test results after learning in the four conditions diverging with and without the two stabilization tricks are depicted in Fig. 9.
First, the vanilla PPO shows the remarkable but expected results. Although PPO is originally considered an on-policy algorithm, as described in the text, it achieved learning of both tasks in combination with ER even without the addition of the proposed stabilization tricks. This is because, as discussed in Section 6.1, PPO performs regularization and clipping heuristically such that \(\pi \simeq b\) (i.e. on-policyness) holds. In fact, PPO with only one of the stabilization trick did not saturate the corresponding internal parameter, indicating that ERC was satisfied without the lack of its functionality. Note, however, that PPO by itself may not be sufficient to satisfy ERC, since GAE, which was omitted from the implementation this time, relies more heavily on the behavior policy than the simple advantage function (i.e. TD error).
Next, the contributions of the proposed stabilization tricks are confirmed. First, it is easy to see that the mining trick increases learning speed, while the counteraction trick tends to decrease it. This may be due to the fact that the direction of policy improvements becomes clearer by making the mining trick replay only the empirical data that are useful for learning, while the counteraction trick restricts policy improvements by binding \(\pi \simeq b\). For the former, actually, the scale of TD error, which implies the learning direction, was increased when the mining trick was added.
On the other hand, the counteraction trick seems to improve the exploration capability in exchange for the learning speed. The return at the end of learning was maximized by including the counteraction trick on Fixedwing, although the return on Quadrotor with it was lower than others due to slow convergence. This may be due to the increase in entropy of \(\pi \) by regularizing it to various the behavior policies in the replay buffer. In fact, the addition of the counteraction trick yielded the decrease of \(\ln \pi \), which corresponds to the negative entropy, during learning.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kobayashi, T. Revisiting experience replayable conditions. Appl Intell 54, 9381–9394 (2024). https://doi.org/10.1007/s10489-024-05685-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05685-7