Balancing Exploration and Exploitation in Self-imitation Learning

Chun-Yao Kang¹⁴ &
Ming-Syan Chen¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12085))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4647 Accesses
1 Citations

Abstract

Sparse reward tasks are always challenging in reinforcement learning. Learning such tasks requires both efficient exploitation and exploration to reduce the sample complexity. One line of research called self-imitation learning is recently proposed, which encourages the agent to do more exploitation by imitating past good trajectories. Exploration bonuses, however, is another line of research which enhances exploration by producing intrinsic reward when the agent visits novel states. In this paper, we introduce a novel framework Explore-then-Exploit (EE), which interleaves self-imitation learning with an exploration bonus to strengthen the effect of these two algorithms. In the exploring stage, with the aid of intrinsic reward, the agent tends to explore unseen states and occasionally collect high rewarding experiences, while in the self-imitating stage, the agent learns to consistently reproduce such experiences and thus provides a better starting point for subsequent stages. Our result shows that EE achieves superior or comparable performance on variants of MuJoCo environments with episodic reward settings.

You have full access to this open access chapter, Download conference paper PDF

Continuous Exploration via Multiple Perspectives in Sparse Reward Environment

Exploration via Progress-Driven Intrinsic Rewards

An exploratory rollout policy for imagination-augmented agents

Article 03 May 2019

Keywords

1 Introduction

Reinforcement learning (RL) learns an optimal policy by maximizing the expected return. These methods work well in environments with dense rewards but suffer from degenerate performance when the rewards are sparse, which means rewards are almost zero during an episode. In such cases, the agent requires both an efficient exploration method that guides itself to find potentially useful information, and an efficient exploitation technique to make better use of these experiences.

Exploration bonus is a simple way for directed exploration, which generates intrinsic rewards every step even when the external rewards are unavailable. This bonus is designed to be higher in novel states than in those visited frequently, and thus encourages the agent to explore new behavior. Even though such method still requires a tremendous amount of time to train, and the intrinsic rewards may vanish when the policy converges to a local optimum.

Another method called self-imitation learning is a recently introduced algorithm that enhances exploitation by storing and reusing useful experiences. Once the agent occasionally generates high rewarding trajectories, they are collected and stored in a replay buffer. The policy is then trained to imitate those trajectories in the buffer and thus the resulting agent consistently reproduces past good behavior. This method is shown to have high sample efficiency, though it hurts the exploration and has a chance to stuck at local optima. Besides, self-imitation learning relies on other techniques to bootstrap the process before the first trajectory is added to the buffer.

In this paper, we propose a framework Explore-then-Exploit (EE) which combines random network distillation (RND) [5], a kind of exploration bonus, and generative adversarial self-imitation learning (GASIL) [7]. By integrating these two methods, the RND bonus solves the initial bootstrap problem of self-imitation and potentially prevents the policy from getting stuck at local optima. On the other hand, GASIL speeds up the convergence of the policy and provides good starting points for later exploration. Nevertheless, a direct composition does not make sense due to that the agent will prefer unpredictable actions with exploration bonuses while tend to reproduce past behaviors with self-imitation. Mixing these two objectives will confuse the agent and lead to an undesired result. Instead, we suggest to combine them in an interleaving manner. By doing so, the agent only learns one concept at each stage and switch to another one when certain criteria are reached. We formulate our framework in the form of reward interpolation and provide some heuristic methods to determine the weight between exploration and self-imitation. Finally, we evaluate our model on several MuJoCo tasks [19] with episodic rewards and empirically show that EE improves over GASIL or RND in most environments.

2 Related Works

Exploration. There have been many researchers working on exploration in RL. Count-based exploration gives a reward to rarely visited states [2, 13]. Prediction-based exploration, also known as the curiosity-driven method, predicts the agent’s dynamics and treats the prediction error as intrinsic reward [1, 4, 14, 18]. Random network distillation (RND) [5] further improves by utilizing two neural networks: a randomly initialized target network, and a predictor network trained to minimize the difference between the outputs of the two networks. The difference is then used as the exploration bonus.

Self-imitation. Self-imitation learning (SIL) [12] was recently proposed to exploit past good behaviors. This algorithm stores the transitions in a replay memory, and uses them to update the policy when the stored return is higher than the current state value estimation. Generative adversarial self-imitation learning (GASIL) [7] is a generative adversarial extension to SIL, which instead stores top-k experiences based on episode returns and formulates it as a divergence minimization problem. Combined with actor-critic methods, GASIL learns a shaped, dense reward function that can be used as an extra reward signal. Another method [6] also utilizes the generative adversarial structure, but instead trains an ensemble of agents and uses a special optimization technique to guarantee the diversity among these agents. This work focuses on the interaction between multiple agents, while our work only considers one agent. This technique can be used simultaneously with our framework without problems.

Amplifying the Imitation Effect. In a very recent work, AIE [10] was proposed to combine RND and SIL, which has similar idea as our method. This work combines these two algorithms directly and introduces several techniques to enhance the effect of RND, whereas ours integrates GASIL in an interleaving fashion, and evaluates it on common RL benchmarks.

3 Background

3.1 Reinforcement Learning

Consider a state space $\mathcal {S}$ and an action space $\mathcal {A}$. The purpose of RL is to find a parameterized policy $\pi _\theta (a|s)$ where $s \in \mathcal {S}$ and $a \in \mathcal {A}$, which maximizes the expected discounted sum of rewards: $\eta (\pi ) = \mathbb {E}_{\pi } [\sum _{t = 0}^{\infty } \gamma ^t r_t]$ where $\gamma \in (0, 1]$ is the discount factor and $r_t$ is the reward at time t. The objective is non differentiable and hence requires the technique called policy gradient to estimate the gradient. A commonly used gradient estimator of the objective $\eta (\pi _\theta )$ is given by

$$\begin{aligned} \nabla _\theta \eta (\pi _\theta ) = \mathbb {E}_{\pi } [\nabla _\theta \log {\pi _\theta (a_t | s_t)} \hat{A}_t], \end{aligned}$$

(1)

where $\hat{A}_t$ is the advantage estimation at time t. The estimator $\nabla _\theta \eta (\pi _\theta )$ can be obtained by differentiating the surrogate objective

$$\begin{aligned} \mathcal {L}^{\mathrm {PG}}(\theta ) = \mathbb {E}_{\pi } [\log {\pi _\theta (a_t | s_t)} \hat{A}_t]. \end{aligned}$$

(2)

3.2 Generative Adversarial Self-imitation Learning

GASIL involves a good trajectory buffer $\mathcal {B} = \{\tau _i\}$ and a discriminator $D_\phi (s, a)$. The buffer $\mathcal {B}$ stores the top-k good trajectories according to the total trajectory reward $R = \sum _{t = 0}^\infty r_t$, where each trajectory $\tau _i$ consists of a sequence of states and actions $\{s_0, a_0, s_1, a_1, \dots s_t, a_t\}$. The algorithm treats the trajectories stored in the buffer $\mathcal {B}$ as the expert demonstrations, and utilizes the GAIL framework [8] to obtain a similar generative adversarial loss

$$\begin{aligned} \begin{aligned} \mathop {\text {argmin}}\limits _\theta \,\mathop {\text {argmax}}\limits _\phi \,\mathcal {L}^{\mathrm {GASIL}} (\theta , \phi ) = {}&\mathbb {E}_{\tau _\pi } [\log {D_\phi (s, a)}] \\&\!\!\! + \mathbb {E}_{\tau _E \sim \mathcal {B}} [\log {(1 - D_\phi (s, a))}] - \lambda \mathcal {H} (\pi _\theta ) \end{aligned}\end{aligned}$$

(3)

where $\tau _\pi $, $\tau _E$ are the trajectories sampled from the policy $\pi _\theta $ and the buffer $\mathcal {B}$ respectively, and $\lambda \mathcal {H} (\pi _\theta )$ is the entropy regularization term. The discriminator $D_\phi $ is updated via

$$\begin{aligned} \nabla _\phi \mathcal {L}^{\mathrm {GASIL}} = \mathbb {E}_{\tau _\pi } [\nabla _\phi \log {D_\phi (s, a)}] + \mathbb {E}_{\tau _E} [\nabla _\phi \log {(1 - D_\phi (s, a))}] \end{aligned}$$

(4)

and the policy $\pi _\theta $ is updated via the approximate gradient

$$\begin{aligned} \begin{aligned}&\nabla _\theta \mathcal {L}^{\mathrm {GASIL}} = \mathbb {E}_{\tau _\pi } [\nabla _\theta \log {\pi _\theta (a | s) Q(s, a)}] - \lambda \nabla _\theta \mathcal {H} (\pi _\theta ), \\&\text {where} \,\, Q(s_t, a_t) = - \mathbb {E}_{\tau _\pi } [\sum _{t' = t}^\infty \gamma ^{t' - t} \log {D_\phi (s_t', a_t')}]. \end{aligned}\end{aligned}$$

(5)

The Eq. (5) has a similar form as policy gradient (1), and thus can be combined together as follows:

$$\begin{aligned} \nabla _\theta \mathcal {L}^{\mathrm {PG}} + \alpha \nabla _\theta \mathcal {L}^{\mathrm {GASIL}} =\mathbb {E}_{\tau _\pi } [\nabla _\theta \log {\pi _\theta (a | s) \hat{A}^\alpha _t}] + \lambda \nabla _\theta \mathcal {H} (\pi _\theta ) \end{aligned}$$

(6)

where $\hat{A}^\alpha _t$ is the advantage estimation by replacing r(s, a) with a modified reward $r(s, a) - \alpha \log {D_\phi (s, a)}$. The extra term $- \log {D_\phi (s, a)}$ can be seen as an intrinsic reward signal generated by the discriminator to encourage the agent to imitate the past behaviors.

3.3 Random Network Distillation

RND introduces a fixed and randomly initialized target network $f: \mathcal {S} \rightarrow \mathbb {R}^k$, which takes a state as input and outputs a k-dimensional embedding, together with a predictor network $\hat{f}: \mathcal {S} \rightarrow \mathbb {R}^k$, which is trained to minimize the MSE ${\Vert }{\hat{f}(s; \theta ) - f(s)}{\Vert }^2$. The exploration bonus is defined as the prediction error ${\Vert }{\hat{f}(s) - f(s)}{\Vert }^2$, which is expected to be lower for the states similar to the frequently visited ones.

4 Explore-then-Exploit Framework

Our EE framework incorporates the exploration bonus component into the GASIL structure. We choose RND as the exploration algorithm because of the simplicity of implementation. Both GASIL and RND generate extra reward signals, which allows us to formulate our framework as an interpolation of the rewards from three different sources: (a) the external environment reward $r^{\mathrm {ext}}$, (b) the imitation bonus $r^{\mathrm {im}} = - \log {D_\phi (s, a)}$ which is derived from the discriminator and (c) the exploration bonus $r^{\mathrm {exp}} = {\Vert }{\hat{f}(s) - f(s)}{\Vert }^2$ which is given by the predictor network. It does not make sense to directly sum up these rewards, as the imitation bonus and exploration bonus guides the agent to different directions. Instead, we use an dynamically adaptive reward

$$\begin{aligned} r = (1 - \nu ) \, r^{\mathrm {im}} + \nu \, (r^{\mathrm {ext}} + r^{\mathrm {exp}}) \end{aligned}$$

(7)

where $\nu $ controls the ratio between exploration and self-imitation. The parameter $\nu $ is explicitly assigned to 0 or 1 to prevent these two terms from interfering with each other. In the exploration stage, we set $\nu = 1$ to completely eliminate the effect of self-imitation, which allows the agent to freely explore the environment. While in the self-imitation stage, we set $\nu = 0$ to make the agent purely rely on the imitation bonus. As such, the agent will quickly converge to a local optimum and begin to explore again when it switches back to the exploration stage.

It is crucial to determine when to assign the ratio $\nu $ to 0, which implies the self-imitation stage, or vice versa. In this work, we introduce a heuristic method that works well in the underlying benchmarks. We assign the ratio $\nu $ as a square wave with period T and duty cycle d. A particular setting $T = 10^6$ and $d = 25\%$ is used throughout the paper, which is shown in Fig. 1. Another method is to set the agent in the exploration stage for certain steps in the beginning and switch to the self-imitation stage for the remaining time, as shown in Fig. 2. We found that by doing so, the agent often leads to superior performance.

Our framework only involves the reward interpolation, and thus can be plugged into any actor-critic based algorithm such as A2C [11] or PPO [16]. We demonstrate the combination of our method with PPO in Algorithm 1. Note that the rewards used to determine the ranking of the trajectories stored in the replay buffer do not include the exploration bonuses. More specifically, the total trajectory reward is defined as $R = \sum _{t = 0} ^\infty r_t^{\mathrm {ext}}$.

5 Experiments and Results

The experiments are designed to answer the following questions:

1.
Is EE better than running RND or GASIL alone?
2.
Is the RND exploration bonus itself necessary, or a random exploration also works?
3.
Is the effect of interleaving fashion critical?

5.1 Implementation Details

We evaluated our method on several OpenAI Gym [3] MuJoCo continuous control tasks. The specs of environments used are listed in Table 1. All of the benchmarks were modified as episodic reward environments, which means that rather than providing the per timestep reward $r_t$, we provided the whole episode reward $R = \sum _{t = 0}^\infty r_t$ at the last step of an episode and zero rewards in other steps.

Table 1. State and action space of OpenAI Gym MuJoCo tasks

Full size table

We implemented the following agents based on this PPO implementation [17]:

PPO: The proximal policy optimization [16] baseline.
PPO + RND: PPO combined with RND bonus [5].
PPO + GASIL: PPO combined with GASIL [7].
EE_interval: Our method where $\nu $ is assigned to be the square wave with period $10^6$ and duty cycle $25\%$.
EE_first_exp: Our method where $\nu $ is assigned to be 1 for the first $3 \times 10^5$ steps and to be 0 for the remaining steps.

The hyperparameters used in our experiments are shown in Table 2. Every feed-forward networks including the actor-critic network, the discriminator and the predictor has 2 hidden layers with 64 neurons. Note that the parameter $\alpha $ is only used in PPO + GASIL, not in EE.

5.2 Episodic MuJoCo

We first evaluated 5 types of agents on 6 MuJoCo tasks. The result in Fig. 3 shows that EE performs better than all of the baseline on Walker2d, Hopper and HalfCheetah, and performs comparably with GASIL on Swimmer. This is because the Swimmer task is relatively simple that exploration is not even necessary. However, on more complicated tasks such as Walker2d and Hopper, the benefit of integrating exploration and self-imitation is significant. For Ant and Humanoid, all of the 5 agents fail to learn a meaningful policy. This is mainly due to the high dimensions of the observation space, which makes the networks difficult to train.

Table 2. EE hyperparameters on MuJoCo.

Full size table

In Fig. 3, we see that the EE_interval has an obvious performance drop when the agent is in the exploration stage and begins to climb again when it switches back to the self-imitation stage. This is the expected behavior since the agent tends to select unpredictable behavior, which potentially causes early termination of an episode. Unfortunately, EE_interval performs slightly worse than EE_first_exp, which seems to indicate that the interleaving one does not have the advantage over the non-interleaving one. One possible reason is that MuJoCo environments do not have the sequentially dependent structure, which means that reliably producing certain rewards does not make it easier to obtain the subsequent rewards. In this case, it is not beneficial at all to first converge to a good policy and then begin to explore from that state.

5.3 Effect of RND

In Fig. 3, it should be noticed that adding the RND bonus alone does not take any notable effect, which gives rise to the question of the effectiveness of RND. We carried out another experiment to investigate this phenomenon. We modified the behavior of EE_first_exp agent in the exploration stage as follows: no exp indicates that the RND bonus is removed from the reward interpolation, which means that the agent only relies on external reward $r^{\mathrm {ext}}$; random exp indicates that the agent always takes a random action; EE_first_exp remains unchanged. The result in Fig. 4 shows that integrating GASIL with RND indeed amplifies the effect of exploration compared to a purely random one.

5.4 Direct Interpolation

To justify the statement that directly mixing the imitation bonus and exploration bonus results in poor performance, we modified the reward interpolation to be $r = \alpha \, r^{\mathrm {im}} + (1 - \alpha )\, r^{\mathrm {ext}} + \beta r^{\mathrm {exp}}$ where $\alpha $ was fixed at 0.8 throughout the experiment, and $\beta $ coefficient was set to be $\{1, 2, 4, 8, 16\}$ respectively. This agent is referred to as direct. Figure 5 shows that direct method with $\beta =1$ performs much the same as GASIL, which points out the fact that imitation bonus is likely to dominate the outcoming policy when given similar weights. Furthermore, the performance drops when setting higher weights on exploration bonus. This result again demonstrates that mixing different behavior will bring about inferior performance.

6 Conclusion

In this paper, we proposed Explore then Exploit (EE), a novel framework that combines GASIL and RND in a form of reward interpolation, and provided a heuristic way to interleaves between exploration and self-imitation stage. We demonstrated that EE significantly improves over existing single-agent methods on several continuous control tasks with episodic rewards. We also empirically justified our hypothesis that separating the objectives of exploration and imitation is better than mixing them together. Developing appropriate ways to automatically adjust the ratio between exploration and imitation will be an important future work. Further, we will apply our framework to more complicated environments such as Atari.

References

Achiam, J., Sastry, S.: Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732 (2017)
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1471–1479. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6383-unifying-count-based-exploration-and-intrinsic-motivation.pdf
Brockman, G., et al.: Openai gym (2016)
Google Scholar
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355 (2018)
Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network distillation. arXiv preprint arXiv:1810.12894 (2018)
Gangwani, T., Liu, Q., Peng, J.: Learning self-imitating diverse policies. arXiv preprint arXiv:1805.10309 (2018)
Guo, Y., Oh, J., Singh, S., Lee, H.: Generative adversarial self-imitation learning. arXiv preprint arXiv:1812.00950 (2018)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, pp. 4565–4573 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, G.T., Kim, C.O.: Amplifying the imitation effect for reinforcement learning of UCAV’s mission execution. arXiv preprint arXiv:1901.05856 (2019)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Oh, J., Guo, Y., Singh, S., Lee, H.: Self-imitation learning. In: International Conference on Machine Learning, pp. 3875–3884 (2018)
Google Scholar
Ostrovski, G., Bellemare, M.G., van den Oord, A., Munos, R.: Count-based exploration with neural density models. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 2017, pp. 2721–2730 (2017). JMLR.org. http://dl.acm.org/citation.cfm?id=3305890.3305962
Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017
Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Shangtong, Z.: Modularized implementation of deep RL algorithms in PyTorch (2018). https://github.com/ShangtongZhang/DeepRL
Stadie, B.C., Levine, S., Abbeel, P.: Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814 (2015)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Kuan-Ting Lai for his helpful comments which improve the presentation of this paper.

Author information

Authors and Affiliations

National Taiwan University, Taipei, Taiwan
Chun-Yao Kang & Ming-Syan Chen

Authors

Chun-Yao Kang
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Syan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Yao Kang .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, CY., Chen, MS. (2020). Balancing Exploration and Exploitation in Self-imitation Learning. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/978-3-030-47436-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-47436-2_21
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47435-5
Online ISBN: 978-3-030-47436-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics