XRL-DINE provides insights into the decision-making of Online Deep RL for adaptive systems. The main information provided by XRL-DINE to explainees are so called DINEs. DINEs allow explainees to understand the causes for RL's decisions and thereby provide explanations why an adaptive system performed its adaptations.
3.1 Baseline Techniques and Limitations
XRL-DINE combines Reward Decomposition [
23] and Interestingness Elements [
58] in such a way as to balance their respective limitations as follows.
Reward Decomposition. Originally proposed to improve learning performance, Reward Decomposition was exploited in [
23] for the sake of explainability. Reward Decomposition splits the reward function
\(\mathcal{R}(S,A)\) of the RL agent into
\(k\) sub-functions
\(\mathcal{R}_{1}(S,A),\ldots,\mathcal{R}_{k}(S,A)\), called
reward channels, which reflect a different aspect of the learning goal. For each of the reward sub-functions
\(\mathcal{R}_{i}(S,A)\) a separate RL agent, which we call
sub-agent, is trained and thus learns its own value-function
\(Q_{i}(S,A)\). To select a concrete action
\(A\) in state
\(S\), an aggregated value-function
\(Q(S,A)\) is computed by accumulating the action-values for each of the actions proposed by the different reward channels, i.e.,
The resulting aggregated value-function \(Q(S,A)\) is then used for action selection, while tradeoffs in decision-making made by the composed agent become observable via the individual reward channels.
When applied to adaptive systems, Reward Decomposition is especially helpful for the typical problem of adapting a system while taking into account multiple quality goals. Each of these quality goals could then be expressed as a reward sub-function. Explanations derived from reward decomposition help to understand which goal a chosen adaptation contributes to. However, no indication of the explanation's relevance is provided, but instead, it requires manually selecting the relevant time steps for which an RL decision should be explained. In particular, when RL decisions are taken at run time, which is the case for Online RL for self-adaptive systems, observing all time steps and their explanations to identify the relevant ones can introduce a significant cognitive overhead for explainees.
Interestingness Elements. The Interestingness Elements technique aims to facilitate the understanding of where the RL agent's capabilities and limitations lie [
58]. This is achieved by extracting interesting agent-environment interactions along a trajectory of time steps. XRL-DINE leverages two specific kinds of Interestingness Elements: “(un)certain executions” and “minima and maxima.”
(Un)certain executions: This Interestingness Element categorizes RL actions into whether the RL agent is certain or uncertain in its decision. The general intuition is that if an RL agent is uncertain in its decision in state \(S\), the RL agent will often perform many different actions when repeatedly faced with state \(S\), while if an RL agent is certain the same or similar actions are always performed. Or phrased differently, an RL agent is considered to be certain in its decision in state \(S\) if it is easy to predict the RL agent's action \(A\).
To determine the uncertainty of a decision for a state
\(S\), the evenness of the probability distribution over actions
\(A\in\mathcal{A}\) is calculated. The probability distribution is approximated as:
Considering the observed trajectory of time steps,
\(n(S)\) is the number of times the RL agent was faced with state
\(S\), while
\(n(S,A)\) is the number of times it executed action
\(A\) after observing
\(S\). The evenness
\(e(S)\) and thus uncertainty for state
\(S\) is then computed as:
An evenness of \(e(S)=1\) indicates maximum uncertainty, while \(e(S)\) close to zero indicates certainty.
Minima and maxima: This Interestingness Element identifies where RL decisions led to a maximum or minimum reward, thereby helping to identify favorable and adverse situations for the RL agent. Maxima elements help understand how well the RL agent may perform with respect to its learning goal. Minima elements help understand how well the RL agent handles difficult situations.
To determine minima and maxima, an estimate of the value function
\(V(S)\) is employed.
\(V(S)\) indicates the maximum value (expected reward) that can be achieved when taking a decision in state
\(S\) and is computed from the action-value function
\(Q(S,A)\) as follows:
States with local minima
\(\mathcal{S}_{\mathrm{min}}\) and local maxima
\(\mathcal{S}_{\mathrm{max}}\) are computed as follows:
\(\mathcal{S}^{\prime}\) is computed as follows:
\(\hat{\mathbb{P}}(S^{\prime}|S,A)\) is the probability of observing
\(S^{\prime}\) when executing action
\(A\) in state
\(S\) and is estimated by
Considering the observed trajectory of time steps, \(n(S,A,S^{\prime})\) is the number of times state \(S^{\prime}\) was visited after the RL agent executing action \(A\) in state \(S\).
When applied to adaptive systems, Interestingness Elements thus help reveal the RL agent's confidence in its decisions, and thereby can aid in debugging the reward function. As mentioned above, RL does not completely eliminate manual development effort as it requires to define a suitable reward function that helps the adaptive system to learn which adaptation to perform in which state. Accordingly, rewards quantify whether a chosen adaptation was a good one or not. If the RL agent is uncertain in a given state or only achieves rather low maxima this may point to a problem with how the reward function was defined; e.g., it may be that one needs to provide stronger rewards to guide the learning process towards more effective adaptations [
36,
69].
3.2 “Reward Channel Dominance” DINE
Building on Reward Decomposition, “Reward Channel Dominance” DINEs provide the information of how each sub-agent would influence each possible action \(A\in\mathcal{A}\) in the given state \(S\). “Reward Channel Dominance” DINEs thereby help understand why a concrete adaptation was chosen in a given state, and why not another possible adaptation was chosen. We introduce two types of “Reward Channel Dominance” DINEs:
Absolute Reward Channel Dominance follows the original approach of Reward Decomposition and gives the action-values
\(Q_{k}(S,A)\) for each action
\(A\in\mathcal{A}\) of each reward channel
\(k\) for a given state
\(S\in\mathcal{S}\). Recall that the action values give the expected cumulative reward when executing action
\(A\) in state
\(S\) (see
Section 2.1) and thereby determine which adaptation is chosen during exploitation in state
\(S\).
Figure 5(a) shows an example of how this type of DINE is visualized in the XRL-DINE dashboard. The action values of the different reward channels are stacked for each possible action
\(A\in\mathcal{A}\), with the height of the bar indicating the aggregated action-value.
Since each of the sub-agents has its own reward function
\(R_{k}\) and thus receives different kinds of rewards, the action values of the different sub-agents may have different ranges (e.g., including even negative values). This makes comparing these action values and thus understanding the reason for the aggregated RL decision difficult. As an example, in
Figure 5(a) it appears that reward channel 1 dominates the decision for all the Actions 1–5. Especially when considering that in many typical adaptive systems adaptations are discrete (e.g., offering different feature combinations of the system; see [
38]), a small difference in action-values may have a profound impact on the adaptive system, as quite a different adaptation is selected among the possible discrete adaptations.
Relative Reward Channel Dominance helps to better understand the individual contributions of the reward channels to the overall decision by converting the absolute action-values
\({Q}_{k}(S,A)\) into relative action-values
\(\tilde{Q}_{k}(S,A)\). To compute these relative action-values, for each reward channel
\(k\), the action-value of the worst-performing action
\(A_{k,0}=\mathrm{min}_{A\in\mathcal{A}}Q_{k}(S,A)\) is subtracted from the action-values of all the other actions, i.e.,
Figure 5(b) shows an example of this type of DINEs, which clearly shows that Action 1 is chosen by the aggregated agent, as it has the highest relative aggregated rewards (i.e., tallest bar). Here, reward channel 1 contributed most to the aggregated decision. Note the different scale (y-axis) of
Figure 5(b) due to the use of relative action-values
\(\tilde{Q}(S,A)\), which helps focus on the decisive contributions to the aggregated decision. While the same insights could be derived from
Figure 5(a), the distinction is more difficult to make as the absolute action-values
\(Q(S,A)\) lie much closer to each other.
3.3 “Uncertain Action” DINE
Applying the original Interestingness Elements approach directly to value-based Deep RL may return misleading results. After convergence of the learning process (e.g., achieved via
\(\epsilon\)-decay; see
Section 2.1), the action-value function
\(Q(S,A)\) remains more or less stable as the RL agent will mainly perform exploitation. During this exploitation, for a given state
\(S^{*}\), always the same action
\(A^{*}\) (the one with the highest action value) will be chosen. As a result, any given state
\(S^{*}\) would be considered certain, as the estimated probability distribution would be very uneven, only peaking at
\(A^{*}\). Thereby, even if the action values only marginally differ from the action value for
\(A^{*}\), a situation which intuitively would be considered uncertain, the original approach would not determine this as uncertain.
To overcome this weakness, XRL-DINE calculates “Uncertain Action” DINEs by using normalized action values
\(\hat{Q}(S,A)\) instead of the estimated probability distribution
\(\hat{\pi}(S,A)\). Normalized action values are computed by first making all action values positive, i.e.,
Using these positive action values, the normalized action values are computed as
The evenness calculation follows the approach proposed by Sequeira and Gervasio [
58] but rather uses the normalized action values, i.e.,
where
\(\mathcal{A}^{+}_{S}=\{a | Q(S,a) \gt 0, a\in\mathcal{A}\}\)To tune the number of DINEs generated, we introduce the hyper-parameter
\(\rho\) (also see
Section 3.6). This hyper-parameter prescribes how much evenness is required to consider a state uncertain. A state
\(S\) is considered uncertain, if
\(e(S)\geq\rho\).
To combine this with Reward Decomposition, the relative action values are calculated for each reward channel. Only if a relevant evenness is found for at least one of the reward channels and also the action of the aggregated RL agent does not correspond to the action that the sub-agent would choose, this identifies an “Uncertain Action.”
Figure 6 shows an example of the visualization of these DINEs. As can be seen, the first two RL decisions at the beginning of the trace are certain, with the RL agent deciding for exactly one action. This is followed by two uncertain actions, with one or even two alternative actions that only marginally differ in their expected reward from the actual action taken. For the selected action we use the color beige. The colors of the alternative actions match the colors defined for the respective reward channel that dominates the selection of alternative action (see
Figure 5).
3.4 “Contrastive Action” DINE
In addition to the graphical visualization, the information of the “Uncertain Action” DINEs and “Reward Channel Dominance” DINEs can also be combined and used to generate a “Contrastive Action” DINE. A “Contrastive Action” DINE provides contrastive explanations in which the action that would have been chosen by the sub-agent in isolation represents the contrastive action. XRL-DINE provides natural-language text for such an explanation following the below template:
To reach the goal \(\lt\)contrastive reward channel\(\gt\), I should actually choose action \(\lt\)contrastive action\(\gt\). However, it is currently more important to choose action \(\lt\)action chosen by aggregated RL agent\(\gt\) to achieve the goal \(\lt\)reward channel that dominated aggregated decision\(\gt\).
For example, a “Contrastive Action” DINE could be:
To reach the goal of Reward Channel 2, I should actually choose Action 1. However, it is currently more important to choose Action 4 to achieve the goal Reward Channel 1.
3.5 “Reward Channel Extremum” DINE
This type of DINEs is based on the “minima and maxima” Interestingness Elements. By combining with Reward Decomposition, “Reward Channel Extremum” DINEs provide an insight into the individual sub-agents’ trajectory of decisions and what actions are taken by the RL sub-agents to leave a local reward minimum as quickly as possible or to maintain a local reward maximum.
The original Interestingness Elements approach envisioned computing minima and maxima based on a past trajectory of RL agent interactions. However, as we aim to explain Online RL for adaptive systems, we expand the original notion to be capable of computing minima and maxima at run time. To determine whether the current state
\(S\) may be considered a minimum or maximum, we thus need to predict the next state
\(S^{\prime}\) for each possible action
\(A\in\mathcal{A}\). The value-based deep RL approaches we focus in this paper (see
Section 2) belong to the class of model-free RL algorithms. This means that they do not use a model of the environment, which otherwise may be used to predict the next state
\(S^{\prime}\). We thus suggest approximating an environment model to be used by XRL-DINE. As such an approximation depends on the concretely chosen RL algorithm, we discuss this further as part of our proof-of-concept implementation in
Section 4.
To generate “Reward Channel Extremum” DINEs, the next states
\(\hat{\mathcal{S}}^{\prime}\) for all possible actions
\(A\in\mathcal{A}\) are predicted from the current state
\(S\) using the approximated environment model. Since local extrema may occur quite often, XRL-DINE offers the hyper-parameter
\(\phi\) to control the number of DINEs generated (also see
Section 3.6). Accordingly, states with local minima
\(\mathcal{S}_{k,\mathrm{min}}\) and local maxima
\(\mathcal{S}_{k,\mathrm{max}}\) are computed as follows:
The value function
\(V_{k}(S)\) of a sub-agent is computed as
Figure 7 shows an example of the visualization of these DINEs. As can be seen, reward channel 1 rather quickly reaches a local maximum when compared to the other reward channels. Yet, also rather quickly, the reward in channel 1 drops—even below the observed previous local minimum—suggesting an important change in the RL agent's environment, which has not yet been sufficiently learned by reward channel 1.
3.6 XRL-DINE Hyper-Parameters
As introduced above, XRL-DINE offers two hyper-parameters to tune the explanation generation process. Here, we indicate how the setting of these hyper-parameters influences the number of DINEs generated, and we experimentally assess their influence.
—
\(\rho\in[0,1]\): The lower this hyper-parameter is set, the more “Uncertain Action” DINEs will be generated. In contrast, for higher values of this hyper-parameter, none or only a few, but possibly more relevant DINEs are generated. For example, if longer traces of RL agent interactions should be explained, it can help to select a lower value for \(\rho\) as this keeps the number of DINEs and thus the cognitive load on the explainee manageable.
—
\(\phi \gt 0\): The frequency and amount of “Reward Channel Extremum” DINEs can be controlled by \(\phi\). In situations where local extrema occur rather often, selecting a higher \(\phi\) reduces the number of DINEs generated.
To illustrate how the setting of these hyper-parameters may impact on the number of DINEs generated,
Figure 8 shows concrete results for the adaptive system exemplar used in our study (see
Section 5). We measure how the number of “Uncertain Action” DINEs depends on
\(\rho\), and how the number of “Reward Channel Extremum” DINEs depends on
\(\phi\). We perform measurements for covering a total of 62,000 time steps. To facilitate comparability of results, we use data from a single run of the RL agent but filter accordingly based on the different hyper-parameter settings.
As can be seen in the charts, the hyper-parameters allow tuning the rate of generated DINEs in a wide range, e.g., from close to zero up to 100% in case of \(\rho\). Thereby, the XRL-DINE hyper-parameters help address different explanation needs of explainees; e.g., coarse-grain observation vs. in-depth debugging. Note that these hyper-parameters can even be changed at run time, thereby facilitating the explainees to dynamically tune the rate of DINEs.