Keywords

1 Introduction

Reinforcement learning (RL) learns an optimal policy by maximizing the expected return. These methods work well in environments with dense rewards but suffer from degenerate performance when the rewards are sparse, which means rewards are almost zero during an episode. In such cases, the agent requires both an efficient exploration method that guides itself to find potentially useful information, and an efficient exploitation technique to make better use of these experiences.

Exploration bonus is a simple way for directed exploration, which generates intrinsic rewards every step even when the external rewards are unavailable. This bonus is designed to be higher in novel states than in those visited frequently, and thus encourages the agent to explore new behavior. Even though such method still requires a tremendous amount of time to train, and the intrinsic rewards may vanish when the policy converges to a local optimum.

Another method called self-imitation learning is a recently introduced algorithm that enhances exploitation by storing and reusing useful experiences. Once the agent occasionally generates high rewarding trajectories, they are collected and stored in a replay buffer. The policy is then trained to imitate those trajectories in the buffer and thus the resulting agent consistently reproduces past good behavior. This method is shown to have high sample efficiency, though it hurts the exploration and has a chance to stuck at local optima. Besides, self-imitation learning relies on other techniques to bootstrap the process before the first trajectory is added to the buffer.

In this paper, we propose a framework Explore-then-Exploit (EE) which combines random network distillation (RND) [5], a kind of exploration bonus, and generative adversarial self-imitation learning (GASIL) [7]. By integrating these two methods, the RND bonus solves the initial bootstrap problem of self-imitation and potentially prevents the policy from getting stuck at local optima. On the other hand, GASIL speeds up the convergence of the policy and provides good starting points for later exploration. Nevertheless, a direct composition does not make sense due to that the agent will prefer unpredictable actions with exploration bonuses while tend to reproduce past behaviors with self-imitation. Mixing these two objectives will confuse the agent and lead to an undesired result. Instead, we suggest to combine them in an interleaving manner. By doing so, the agent only learns one concept at each stage and switch to another one when certain criteria are reached. We formulate our framework in the form of reward interpolation and provide some heuristic methods to determine the weight between exploration and self-imitation. Finally, we evaluate our model on several MuJoCo tasks [19] with episodic rewards and empirically show that EE improves over GASIL or RND in most environments.

2 Related Works

Exploration. There have been many researchers working on exploration in RL. Count-based exploration gives a reward to rarely visited states [2, 13]. Prediction-based exploration, also known as the curiosity-driven method, predicts the agent’s dynamics and treats the prediction error as intrinsic reward [1, 4, 14, 18]. Random network distillation (RND) [5] further improves by utilizing two neural networks: a randomly initialized target network, and a predictor network trained to minimize the difference between the outputs of the two networks. The difference is then used as the exploration bonus.

Self-imitation. Self-imitation learning (SIL) [12] was recently proposed to exploit past good behaviors. This algorithm stores the transitions in a replay memory, and uses them to update the policy when the stored return is higher than the current state value estimation. Generative adversarial self-imitation learning (GASIL) [7] is a generative adversarial extension to SIL, which instead stores top-k experiences based on episode returns and formulates it as a divergence minimization problem. Combined with actor-critic methods, GASIL learns a shaped, dense reward function that can be used as an extra reward signal. Another method [6] also utilizes the generative adversarial structure, but instead trains an ensemble of agents and uses a special optimization technique to guarantee the diversity among these agents. This work focuses on the interaction between multiple agents, while our work only considers one agent. This technique can be used simultaneously with our framework without problems.

Amplifying the Imitation Effect. In a very recent work, AIE [10] was proposed to combine RND and SIL, which has similar idea as our method. This work combines these two algorithms directly and introduces several techniques to enhance the effect of RND, whereas ours integrates GASIL in an interleaving fashion, and evaluates it on common RL benchmarks.

3 Background

3.1 Reinforcement Learning

Consider a state space \(\mathcal {S}\) and an action space \(\mathcal {A}\). The purpose of RL is to find a parameterized policy \(\pi _\theta (a|s)\) where \(s \in \mathcal {S}\) and \(a \in \mathcal {A}\), which maximizes the expected discounted sum of rewards: \(\eta (\pi ) = \mathbb {E}_{\pi } [\sum _{t = 0}^{\infty } \gamma ^t r_t]\) where \(\gamma \in (0, 1]\) is the discount factor and \(r_t\) is the reward at time t. The objective is non differentiable and hence requires the technique called policy gradient to estimate the gradient. A commonly used gradient estimator of the objective \(\eta (\pi _\theta )\) is given by

$$\begin{aligned} \nabla _\theta \eta (\pi _\theta ) = \mathbb {E}_{\pi } [\nabla _\theta \log {\pi _\theta (a_t | s_t)} \hat{A}_t], \end{aligned}$$
(1)

where \(\hat{A}_t\) is the advantage estimation at time t. The estimator \(\nabla _\theta \eta (\pi _\theta )\) can be obtained by differentiating the surrogate objective

$$\begin{aligned} \mathcal {L}^{\mathrm {PG}}(\theta ) = \mathbb {E}_{\pi } [\log {\pi _\theta (a_t | s_t)} \hat{A}_t]. \end{aligned}$$
(2)

3.2 Generative Adversarial Self-imitation Learning

GASIL involves a good trajectory buffer \(\mathcal {B} = \{\tau _i\}\) and a discriminator \(D_\phi (s, a)\). The buffer \(\mathcal {B}\) stores the top-k good trajectories according to the total trajectory reward \(R = \sum _{t = 0}^\infty r_t\), where each trajectory \(\tau _i\) consists of a sequence of states and actions \(\{s_0, a_0, s_1, a_1, \dots s_t, a_t\}\). The algorithm treats the trajectories stored in the buffer \(\mathcal {B}\) as the expert demonstrations, and utilizes the GAIL framework [8] to obtain a similar generative adversarial loss

$$\begin{aligned} \begin{aligned} \mathop {\text {argmin}}\limits _\theta \,\mathop {\text {argmax}}\limits _\phi \,\mathcal {L}^{\mathrm {GASIL}} (\theta , \phi ) = {}&\mathbb {E}_{\tau _\pi } [\log {D_\phi (s, a)}] \\&\!\!\! + \mathbb {E}_{\tau _E \sim \mathcal {B}} [\log {(1 - D_\phi (s, a))}] - \lambda \mathcal {H} (\pi _\theta ) \end{aligned}\end{aligned}$$
(3)

where \(\tau _\pi \), \(\tau _E\) are the trajectories sampled from the policy \(\pi _\theta \) and the buffer \(\mathcal {B}\) respectively, and \(\lambda \mathcal {H} (\pi _\theta )\) is the entropy regularization term. The discriminator \(D_\phi \) is updated via

$$\begin{aligned} \nabla _\phi \mathcal {L}^{\mathrm {GASIL}} = \mathbb {E}_{\tau _\pi } [\nabla _\phi \log {D_\phi (s, a)}] + \mathbb {E}_{\tau _E} [\nabla _\phi \log {(1 - D_\phi (s, a))}] \end{aligned}$$
(4)

and the policy \(\pi _\theta \) is updated via the approximate gradient

$$\begin{aligned} \begin{aligned}&\nabla _\theta \mathcal {L}^{\mathrm {GASIL}} = \mathbb {E}_{\tau _\pi } [\nabla _\theta \log {\pi _\theta (a | s) Q(s, a)}] - \lambda \nabla _\theta \mathcal {H} (\pi _\theta ), \\&\text {where} \,\, Q(s_t, a_t) = - \mathbb {E}_{\tau _\pi } [\sum _{t' = t}^\infty \gamma ^{t' - t} \log {D_\phi (s_t', a_t')}]. \end{aligned}\end{aligned}$$
(5)

The Eq. (5) has a similar form as policy gradient (1), and thus can be combined together as follows:

$$\begin{aligned} \nabla _\theta \mathcal {L}^{\mathrm {PG}} + \alpha \nabla _\theta \mathcal {L}^{\mathrm {GASIL}} =\mathbb {E}_{\tau _\pi } [\nabla _\theta \log {\pi _\theta (a | s) \hat{A}^\alpha _t}] + \lambda \nabla _\theta \mathcal {H} (\pi _\theta ) \end{aligned}$$
(6)

where \(\hat{A}^\alpha _t\) is the advantage estimation by replacing r(sa) with a modified reward \(r(s, a) - \alpha \log {D_\phi (s, a)}\). The extra term \(- \log {D_\phi (s, a)}\) can be seen as an intrinsic reward signal generated by the discriminator to encourage the agent to imitate the past behaviors.

3.3 Random Network Distillation

RND introduces a fixed and randomly initialized target network \(f: \mathcal {S} \rightarrow \mathbb {R}^k\), which takes a state as input and outputs a k-dimensional embedding, together with a predictor network \(\hat{f}: \mathcal {S} \rightarrow \mathbb {R}^k\), which is trained to minimize the MSE \({\Vert }{\hat{f}(s; \theta ) - f(s)}{\Vert }^2\). The exploration bonus is defined as the prediction error \({\Vert }{\hat{f}(s) - f(s)}{\Vert }^2\), which is expected to be lower for the states similar to the frequently visited ones.

4 Explore-then-Exploit Framework

Our EE framework incorporates the exploration bonus component into the GASIL structure. We choose RND as the exploration algorithm because of the simplicity of implementation. Both GASIL and RND generate extra reward signals, which allows us to formulate our framework as an interpolation of the rewards from three different sources: (a) the external environment reward \(r^{\mathrm {ext}}\), (b) the imitation bonus \(r^{\mathrm {im}} = - \log {D_\phi (s, a)}\) which is derived from the discriminator and (c) the exploration bonus \(r^{\mathrm {exp}} = {\Vert }{\hat{f}(s) - f(s)}{\Vert }^2\) which is given by the predictor network. It does not make sense to directly sum up these rewards, as the imitation bonus and exploration bonus guides the agent to different directions. Instead, we use an dynamically adaptive reward

$$\begin{aligned} r = (1 - \nu ) \, r^{\mathrm {im}} + \nu \, (r^{\mathrm {ext}} + r^{\mathrm {exp}}) \end{aligned}$$
(7)

where \(\nu \) controls the ratio between exploration and self-imitation. The parameter \(\nu \) is explicitly assigned to 0 or 1 to prevent these two terms from interfering with each other. In the exploration stage, we set \(\nu = 1\) to completely eliminate the effect of self-imitation, which allows the agent to freely explore the environment. While in the self-imitation stage, we set \(\nu = 0\) to make the agent purely rely on the imitation bonus. As such, the agent will quickly converge to a local optimum and begin to explore again when it switches back to the exploration stage.

Fig. 1.
figure 1

Square wave with period \(10^6\) and duty cycle 25%.

Fig. 2.
figure 2

Heuristic method which sets \(\nu = 1\) for the first \(3 \times 10^5\) steps and sets \(\nu = 0\) for the remaining time

It is crucial to determine when to assign the ratio \(\nu \) to 0, which implies the self-imitation stage, or vice versa. In this work, we introduce a heuristic method that works well in the underlying benchmarks. We assign the ratio \(\nu \) as a square wave with period T and duty cycle d. A particular setting \(T = 10^6\) and \(d = 25\%\) is used throughout the paper, which is shown in Fig. 1. Another method is to set the agent in the exploration stage for certain steps in the beginning and switch to the self-imitation stage for the remaining time, as shown in Fig. 2. We found that by doing so, the agent often leads to superior performance.

Our framework only involves the reward interpolation, and thus can be plugged into any actor-critic based algorithm such as A2C [11] or PPO [16]. We demonstrate the combination of our method with PPO in Algorithm 1. Note that the rewards used to determine the ranking of the trajectories stored in the replay buffer do not include the exploration bonuses. More specifically, the total trajectory reward is defined as \(R = \sum _{t = 0} ^\infty r_t^{\mathrm {ext}}\).

figure a

5 Experiments and Results

The experiments are designed to answer the following questions:

  1. 1.

    Is EE better than running RND or GASIL alone?

  2. 2.

    Is the RND exploration bonus itself necessary, or a random exploration also works?

  3. 3.

    Is the effect of interleaving fashion critical?

5.1 Implementation Details

We evaluated our method on several OpenAI Gym [3] MuJoCo continuous control tasks. The specs of environments used are listed in Table 1. All of the benchmarks were modified as episodic reward environments, which means that rather than providing the per timestep reward \(r_t\), we provided the whole episode reward \(R = \sum _{t = 0}^\infty r_t\) at the last step of an episode and zero rewards in other steps.

Table 1. State and action space of OpenAI Gym MuJoCo tasks

We implemented the following agents based on this PPO implementation [17]:

  • PPO: The proximal policy optimization [16] baseline.

  • PPO + RND: PPO combined with RND bonus [5].

  • PPO + GASIL: PPO combined with GASIL [7].

  • EE_interval: Our method where \(\nu \) is assigned to be the square wave with period \(10^6\) and duty cycle \(25\%\).

  • EE_first_exp: Our method where \(\nu \) is assigned to be 1 for the first \(3 \times 10^5\) steps and to be 0 for the remaining steps.

The hyperparameters used in our experiments are shown in Table 2. Every feed-forward networks including the actor-critic network, the discriminator and the predictor has 2 hidden layers with 64 neurons. Note that the parameter \(\alpha \) is only used in PPO + GASIL, not in EE.

5.2 Episodic MuJoCo

We first evaluated 5 types of agents on 6 MuJoCo tasks. The result in Fig. 3 shows that EE performs better than all of the baseline on Walker2d, Hopper and HalfCheetah, and performs comparably with GASIL on Swimmer. This is because the Swimmer task is relatively simple that exploration is not even necessary. However, on more complicated tasks such as Walker2d and Hopper, the benefit of integrating exploration and self-imitation is significant. For Ant and Humanoid, all of the 5 agents fail to learn a meaningful policy. This is mainly due to the high dimensions of the observation space, which makes the networks difficult to train.

Table 2. EE hyperparameters on MuJoCo.

In Fig. 3, we see that the EE_interval has an obvious performance drop when the agent is in the exploration stage and begins to climb again when it switches back to the self-imitation stage. This is the expected behavior since the agent tends to select unpredictable behavior, which potentially causes early termination of an episode. Unfortunately, EE_interval performs slightly worse than EE_first_exp, which seems to indicate that the interleaving one does not have the advantage over the non-interleaving one. One possible reason is that MuJoCo environments do not have the sequentially dependent structure, which means that reliably producing certain rewards does not make it easier to obtain the subsequent rewards. In this case, it is not beneficial at all to first converge to a good policy and then begin to explore from that state.

Fig. 3.
figure 3

Learning curves for PPO, RND, GASIL, and our method EE with two different scheduling on 6 OpenAI MuJoCo tasks with episodic rewards. Mean and standard deviation over 5 random seeds are plotted.

Fig. 4.
figure 4

Learning curves for EE with 3 different exploration behaviors on 3 MuJoCo tasks. Mean and standard deviation over 5 random seeds are plotted.

5.3 Effect of RND

In Fig. 3, it should be noticed that adding the RND bonus alone does not take any notable effect, which gives rise to the question of the effectiveness of RND. We carried out another experiment to investigate this phenomenon. We modified the behavior of EE_first_exp agent in the exploration stage as follows: no exp indicates that the RND bonus is removed from the reward interpolation, which means that the agent only relies on external reward \(r^{\mathrm {ext}}\); random exp indicates that the agent always takes a random action; EE_first_exp remains unchanged. The result in Fig. 4 shows that integrating GASIL with RND indeed amplifies the effect of exploration compared to a purely random one.

Fig. 5.
figure 5

Learning curves for agents with two types of reward interpolation trained on 3 MuJoCo tasks. Mean and standard deviation over 5 random seeds are plotted.

5.4 Direct Interpolation

To justify the statement that directly mixing the imitation bonus and exploration bonus results in poor performance, we modified the reward interpolation to be \(r = \alpha \, r^{\mathrm {im}} + (1 - \alpha )\, r^{\mathrm {ext}} + \beta r^{\mathrm {exp}}\) where \(\alpha \) was fixed at 0.8 throughout the experiment, and \(\beta \) coefficient was set to be \(\{1, 2, 4, 8, 16\}\) respectively. This agent is referred to as direct. Figure 5 shows that direct method with \(\beta =1\) performs much the same as GASIL, which points out the fact that imitation bonus is likely to dominate the outcoming policy when given similar weights. Furthermore, the performance drops when setting higher weights on exploration bonus. This result again demonstrates that mixing different behavior will bring about inferior performance.

6 Conclusion

In this paper, we proposed Explore then Exploit (EE), a novel framework that combines GASIL and RND in a form of reward interpolation, and provided a heuristic way to interleaves between exploration and self-imitation stage. We demonstrated that EE significantly improves over existing single-agent methods on several continuous control tasks with episodic rewards. We also empirically justified our hypothesis that separating the objectives of exploration and imitation is better than mixing them together. Developing appropriate ways to automatically adjust the ratio between exploration and imitation will be an important future work. Further, we will apply our framework to more complicated environments such as Atari.