research-article

Open access

A User Study on Explainable Online Reinforcement Learning for Adaptive Systems

Authors:

Andreas Metzger,

Jan Laufer,

Felix Feit,

Klaus PohlAuthors Info & Claims

ACM Transactions on Autonomous and Adaptive Systems, Volume 19, Issue 3

Article No.: 15, Pages 1 - 44

https://doi.org/10.1145/3666005

Published: 30 September 2024 Publication History

PDF eReader

Abstract

Online reinforcement learning (RL) is increasingly used for realizing adaptive systems in the presence of design time uncertainty because Online RL can leverage data only available at run time. With Deep RL gaining interest, the learned knowledge is no longer represented explicitly but hidden in the parameterization of the underlying artificial neural network. For a human, it thus becomes practically impossible to understand the decision-making of Deep RL, which makes it difficult for (1) software engineers to perform debugging, (2) system providers to comply with relevant legal frameworks, and (3) system users to build trust. The explainable RL technique XRL-DINE, introduced in earlier work, provides insights into why certain decisions were made at important time steps. Here, we perform an empirical user study concerning XRL-DINE involving 73 software engineers split into treatment and control groups. The treatment group is given access to XRL-DINE, while the control group is not. We analyze (1) the participants’ performance in answering concrete questions related to the decision-making of Deep RL, (2) the participants’ self-assessed confidence in giving the right answers, (3) the perceived usefulness and ease of use of XRL-DINE, and (4) the concrete usage of the XRL-DINE dashboard.

1 Introduction

An adaptive system (a.k.a. self-adaptive system) can modify its own structure and behavior at run time based on its perception of the environment, of itself and of its requirements [25, 56, 65]. Examples of adaptive systems include elastic cloud systems [7], intelligent IoT systems [1], and proactive process management systems [34].

One key element of an adaptive system is its adaptation logic that encodes when and how the system should adapt itself. When developing the adaptation logic, developers face the challenge of design time uncertainty [4, 38]. On the one hand, developers have to define when the system should adapt. To do so, they have to anticipate all potential environment states. However, this is infeasible in most cases due to incomplete information at design time. As an example, the concrete services that may be dynamically bound during the execution of service orchestration and their quality-of-service characteristics are typically not known at design time. On the other hand, developers have to define how the system should adapt itself. To do so, they need to know the precise effect an adaptation has. However, the precise effect may not be known at design time. As an example, while developers may know in principle that enabling more features will negatively influence system performance, exactly determining the performance impact is more challenging. A large-scale survey about self-adaptation in industry performed by Weyns et al. [67] indicates that optimal design and design complexity together with design time uncertainty are the most frequently observed difficulties in designing adaptive systems in practice.

1.1 Online Reinforcement Learning for Adaptive Systems

Online reinforcement learning (Online RL) is an emerging approach to realize adaptive systems in the presence of design time uncertainty [38, 48, 50]. Online RL means that RL is employed at run time. RL aims to learn an optimal action-selection policy, which is used to decide on which action (here: adaptation) \(A\) to execute in any given environment state \(S\) [60]. Using Online RL, adaptive systems can learn from actual operational data and thereby leverage information only available at run time. Based on run time monitoring data, RL receives a numeric reward \(R\) for executing an adaptation. The reward expresses how suitable that adaptation was in the short term. The goal of Online RL is to maximize the cumulative reward received. Online RL thus helps to learn which adaptation \(A\) to execute when faced with environment state \(S\).

Recent research on using RL for realizing adaptive systems leverages Deep RL algorithms, which represent their action-selection policy as an artificial neural network [19, 35, 62]. Benefits of Deep RL include that environment states are not limited to elements of finite or discrete sets, and that the used artificial neural networks can generalize well over unseen environment states. Some Deep RL algorithms can even capture concept drift in adaptive systems without the need to explicitly introduce mechanisms to observe such drift [36, 48].

1.2 Explainable Online RL

One key downside of Deep RL is that the learned action-selection policy is not represented explicitly but is hidden in the parameterization of the artificial neural network. For humans, the decision-making of Deep RL thus essentially appears as a black box, as it is practically impossible to relate this parameterization to concrete RL decisions [51]. We thus require techniques to explain and interpret the internal workings of Deep RL and how its decisions are made [20, 39, 41].

Explaining the Deep RL decisions in the context of adaptive systems addresses different needs, including the following:

—

Explainability can help software engineers to debug the reward function by helping them to understand why Deep RL took certain decisions [17]. The successful application of Deep RL depends on how well the learning problem, and in particular the reward function, is defined [12]. Debugging of Deep RL is especially relevant for adaptive systems because Online RL does not completely eliminate manual development effort. In particular, software engineers need to explicitly define a reward function, which quantifies the feedback to the RL algorithm. Such a reward function may be derived from a utility function that balances the various, often conflicting, goal dimensions. Getting such a utility function “right”—in a sense that it accurately reflects the tradeoff among different goal dimensions—is a challenge [12]. Additionally, recent findings show that modeling the reward function too closely on reality can slow down learning [36]. As a consequence, defining the reward function introduces a potential source for human error.

—

Explainability can aid system and service providers in regulatory compliance [41]. For example, in the EU, systems must comply with the relevant legal frameworks, such as the General Data Protection Regulation and the forthcoming AI Act.

—

Explanations facilitate users to build trust. They can understand how the system arrived at its results and thus can accept its results or not [41].

To provide insights into Online Deep RL's decision-making for adaptive systems, we introduced the explainable RL Decomposed Interestingness Element (XRL-DINE) technique [16]. XRL-DINE enhances and combines two existing XRL techniques from the machine learning (ML) literature: Reward Decomposition [23] and Interestingness Elements [58]. Reward Decomposition uses a suitable decomposition of the reward function into sub-functions to explain the short-term goal orientation of RL, thereby providing contrastive explanations. Reward composition is especially helpful for the typical problem of adapting a system while taking into account multiple quality goals. Each of these quality goals could then be expressed as a reward sub-function. These explanations help to understand which goal an adaptation chosen by RL contributes to. Interestingness Elements collect and evaluate metrics at run time to identify relevant moments of interaction between the system and its environment. In particular, when RL decisions are taken at run time, which is the case for Online RL for adaptive systems, monitoring all explanations to identify relevant ones introduces cognitive overhead. Interestingness Elements thereby facilitate selecting relevant actions.

1.3 Problem Statement and Contributions

To assess the applicability and potential usefulness of XRL-DINE, we performed an initial evaluation in our previous work [16]. First, we prototypically implemented XRL-DINE using a state-of-the-art Deep RL algorithm, serving as a proof-of-concept. Second, we demonstrated the potential usefulness of XRL-DINE by applying it to an adaptive web application exemplar [43]. Third, we measured indicators for the potential reduction in cognitive load required to interpret explanations depending on how XRL-DINE was configured.

This initial evaluation indicated that XRL-DINE may be used by developers to gain insights and spot errors in the decision-making of RL-based adaptive systems. However, our initial evaluation did not directly take into account the “human factor” [13], meaning that we did not involve actual developers in our evaluation process.

Our main new contribution compared to our earlier work in [16] is an empirical user study considering the “human factor” for evaluating XRL-DINE. We follow the human-grounded evaluation approach proposed in [13], which means that we conduct human-subject experiments involving simplified tasks that maintain the essence of the actual real-world application. In this user study, we involved 73 software engineers from academia and industry. We perform a comparative study, where the treatment group is given access to XRL-DINE while the control group is not. In particular, we evaluate XRL-DINE considering the following main research questions:

—

RQ1—What is the participants’ performance in answering concrete questions related to the decision-making of Deep RL?

—

RQ2—How does the participants’ self-assessed confidence relate to the participants’ performance?

—

RQ3—How do the participants in the treatment group perceive the usefulness and ease of use of XRL-DINE?

—

RQ4—How did the participants in the treatment group use XRL-DINE?

1.4 Supplementary Material

To facilitate the reproducibility and replicability of our research, we provide supplementary material in an online repository at https://git.uni-due.de/rl4sas/xrl-dine.

1.5 Paper Organization

Section 2 provides relevant foundations. Section 3 conceptually and formally describes the XRL-DINE technique. Section 4 describes the proof-of-concept implementation of XRL-DINE. Section 5 introduces the adaptive system exemplar we use as a basis for our user study and gives a concrete example for applying XRL-DINE. Section 6 describes the design of our user study, Section 7 presents its execution, and Section 8 presents and discusses its results. Section 9 discusses potential enhancements of XRL-DINE. Section 10 discusses related work.

2 Foundations

Below, we provide relevant foundations on online RL for adaptive systems, as well as explainable ML.

2.1 Online RL for Adaptive Systems

2.1.1 RL.

In general, RL aims at learning an optimal action-selection policy \(\pi\) of a so called RL agent by interacting with the RL agent's environment, typically at discrete time steps. Upon executing an action \(A\in\mathcal{A}\) in state \(S\in\mathcal{S}\) at time step \(t\), the environment transitions to the next state \(S^{\prime}\) and awards a specific numeric reward \(R\in\mathbb{R}\) based on a reward function \(\mathcal{R}(S,A)=R\). The action-selection policy maps states \(\mathcal{S}\) to a probability distribution over the action space, i.e., set of possible actions \(\mathcal{A}\). Formally, \(\pi:\mathcal{S}\times\mathcal{A}\to[0,1]\), giving the probability of taking action \(A\) in state \(S\), i.e., \(\pi=\mathbb{P}(A|S)\). An optimal policy \(\pi\) is a policy that optimizes the cumulative reward received [60].

XRL-DINE provides insights into the decision-making of a particular kind of Deep RL algorithms, so-called value-based Deep RL algorithms—depicted in Figure 1. In value-based RL, the action-selection policy \(\pi\) depends on a learned action-value function, also called \(Q\) function, \(Q(S,A)\). The action-value function gives the expected cumulative reward when executing action \(A\) in state \(S\). Value-based Deep RL uses an artificial neural network to approximate \(Q(S,A)\). During a policy update, the weights of the artificial neural network are updated using a trajectory of past actions, states and rewards [60].

Fig. 1.

The actual policy \(\pi\) in value-based Deep RL is realized by combining \(Q(S,A)\) with an action-selection function. The action-selection function determines for each state \(S\) whether the RL agent should perform exploitation or exploration. During exploitation, the RL agent uses the learned knowledge to choose the best action based on \(Q(S,A)\). During exploration the RL agent seeks to execute new actions. Obviously, actions should be selected that have shown to be effective (exploitation). However, to discover such actions in the first place, actions that were not selected before should be selected (exploration). Value-based Deep RL thus requires determining how to balance exploitation and exploration [48]. One typical solution is the \(\epsilon\)-greedy mechanism. The action-selection function randomly chooses an action \(A\in\mathcal{A}\) with probability \(\epsilon\) (exploration), and chooses the the action with the highest expected cumulative reward as determined by \(Q(S,A)\) with probability \(1-\epsilon\) (exploitation). To facilitate convergence of the learning process, one may implement a mechanism that decreases \(\epsilon\) over time (\(\epsilon\)-decay), thereby reducing the amount of exploration.

Based on the previous foundations, we illustrate the problem of explainability when using Deep RL. Figure 2 illustrates this problem using the classical RL toy example of cliff walk. In the cliff walk example, RL shall learn the optimal (i.e., shortest) path from start (S) to goal (G) while not falling off the cliff. Accordingly, falling off the cliff is penalized with a negative reward of \(R=-100\), while each step along the path is only penalized with \(R=-1\) to encourage learning the shortest path. A classical RL algorithm to solve this problem is tabular Q-Learning, where the action-value function \(Q\) is represented explicitly in a table, as shown on the left-hand side of Figure 2. Each cell gives the expected reward when performing action \(A\) in state \(s\in S\). This table clearly indicates that RL has learned to avoid falling off the cliff (actions that move the agent off the cliff have the lowest expected rewards) and to reach the goal in an optimal way (actions along the shortest path have the highest rewards). In comparison, the right-hand side of Figure 2 shows the weights of the artificial neural network for Deep RL.

Fig. 2.

2.1.2 Adaptive Systems.

An adaptive system is capable of modifying its own structure and behavior at run-time based on its perception of changes in its environment, its requirements, and in the system itself [56]. Most software systems exhibit some degree of adaptivity; e.g., they can respond to certain kinds of exceptions and error situations. To capture the key characteristics of the types of adaptive systems that we address, we use an emerging definition from software engineering research, which provides an architectural view-point [65]: An adaptive system can be structured into two main conceptual elements: the system logic and the adaptation logic.

To define the conceptual elements of the adaptation logic, we follow the well-established MAPE-K reference model for adaptive systems depicted in Figure 3. MAPE-K structures the adaptation logic into four main conceptual activities that rely on a common knowledge base [25, 68]. These activities monitor the system and its environment, analyze monitored data to determine adaptation needs, plan adaptations, and execute these adaptations at run time.

Fig. 3.

2.1.3 Online RL for Adaptive Systems.

Online RL applies RL during system operation, where actions have an effect on the live system, resulting in reward signals based on actual monitoring data [38, 48]. Figure 4 depicts how the elements of value-based RL are integrated with MAPE-K to facilitate learning effective adaptations at run time.

Fig. 4.

In the integrated model, action selection of RL takes the place of the analyze and plan activities of MAPE-K. The learned action-value function \(Q(S,A)\) takes the place of the adaptive system's knowledge base. At run time, the policy is used by action selection to select an adaptation \(A\) based on the current state \(S\) determined by monitoring. Action selection determines whether there is a need for an adaptation (given the current state) and plans (i.e., selects) the respective adaptation to execute.

2.2 Explainable ML

Explainability is becoming increasingly important with the wide-spread use of deep learning for decision making [20, 39, 57]. While deep learning outperforms traditional ML models in terms of accuracy, one major drawback is that deep learning models essentially appear as a black box to its users and developers. This means that users and developers are not able to interpret a black-box ML model's outcome, i.e., its decision, prediction, or action. Using such black-box models for decision-making entails significant risks [55].

Below, we provide definitions of the concepts “explanation,” “explainability,” and “interpretability” to serve as foundations for the remainder of this paper. Note that there is not yet a consistent understanding and use of these terms. As an example, explainability and interpretability are two related but distinct concepts but often used interchangeably in the ML literature.

Explanation. The term “explanation” in explainable ML may refer to two concepts [39]: (i) The process of explaining and (ii) the result of this explanation process.

Explanations may either be local or global [59]. A local explanation provides the causes for a concrete outcome of an ML model [63], thereby facilitating the understanding of the reasons for a specific decision [20]. A global explanation provides an understanding of the overall logic of the ML model, thereby allowing to follow the model's entire reasoning leading to all the different possible outcomes [20]. We focus on local explanations.

Explanation as a process involves (a) a cognitive process that determines the causes for an ML model's outcome and (b) a knowledge transfer process from the explainer to an explainee.

Explainability. Explainabilty refers to the capability of being able to provide explanations. Explainability thus supports the aforementioned cognitive process. In other words, explainability characterizes the degree to which the cause for an ML model's decision can be understood.

Interpretability. A stronger concept than explainability is interpretability. Interpretability refers to the ease with which a human can understand the internal workings of an ML model [63]. Ideally, an interpretable ML model is one that is easy to understand and analyze, even for people who are not experts in the field of ML. Typical examples for interpretable ML models are decision trees or linear regression. Interpretability facilitates explainability, as understanding the internal workings of an ML model helps identify the causes for an ML model outcome.

Both interpretability and explainability strongly depend on the explainee's prior knowledge, experience and limitations [54]. For example, while a random forest (i.e., an ensemble of \(m\) decision trees) is interpretable in principle, in practice, one faces the explainee's cognitive limitations if, e.g., \(m\) is large or the number of input features is high.

3 The XRL-DINE Technique

XRL-DINE provides insights into the decision-making of Online Deep RL for adaptive systems. The main information provided by XRL-DINE to explainees are so called DINEs. DINEs allow explainees to understand the causes for RL's decisions and thereby provide explanations why an adaptive system performed its adaptations.

We first explain the two explainable ML techniques that XRL-DINE leverages as baselines: Reward Decomposition [23] and Interestingness Elements [58]. We then explain how these baseline techniques are adapted to deliver different types of DINEs focusing on different aspects of the decision-making process, and conclude with explaining the hyper-parameters of XRL-DINE.

3.1 Baseline Techniques and Limitations

XRL-DINE combines Reward Decomposition [23] and Interestingness Elements [58] in such a way as to balance their respective limitations as follows.

Reward Decomposition. Originally proposed to improve learning performance, Reward Decomposition was exploited in [23] for the sake of explainability. Reward Decomposition splits the reward function \(\mathcal{R}(S,A)\) of the RL agent into \(k\) sub-functions \(\mathcal{R}_{1}(S,A),\ldots,\mathcal{R}_{k}(S,A)\), called reward channels, which reflect a different aspect of the learning goal. For each of the reward sub-functions \(\mathcal{R}_{i}(S,A)\) a separate RL agent, which we call sub-agent, is trained and thus learns its own value-function \(Q_{i}(S,A)\). To select a concrete action \(A\) in state \(S\), an aggregated value-function \(Q(S,A)\) is computed by accumulating the action-values for each of the actions proposed by the different reward channels, i.e.,

\begin{align*}\forall A\in\mathcal{A}:Q(S,A)=\sum_{i=1,\ldots,k}{Q_{i}(S,A)}.\end{align*}

The resulting aggregated value-function \(Q(S,A)\) is then used for action selection, while tradeoffs in decision-making made by the composed agent become observable via the individual reward channels.

When applied to adaptive systems, Reward Decomposition is especially helpful for the typical problem of adapting a system while taking into account multiple quality goals. Each of these quality goals could then be expressed as a reward sub-function. Explanations derived from reward decomposition help to understand which goal a chosen adaptation contributes to. However, no indication of the explanation's relevance is provided, but instead, it requires manually selecting the relevant time steps for which an RL decision should be explained. In particular, when RL decisions are taken at run time, which is the case for Online RL for self-adaptive systems, observing all time steps and their explanations to identify the relevant ones can introduce a significant cognitive overhead for explainees.

Interestingness Elements. The Interestingness Elements technique aims to facilitate the understanding of where the RL agent's capabilities and limitations lie [58]. This is achieved by extracting interesting agent-environment interactions along a trajectory of time steps. XRL-DINE leverages two specific kinds of Interestingness Elements: “(un)certain executions” and “minima and maxima.”

(Un)certain executions: This Interestingness Element categorizes RL actions into whether the RL agent is certain or uncertain in its decision. The general intuition is that if an RL agent is uncertain in its decision in state \(S\), the RL agent will often perform many different actions when repeatedly faced with state \(S\), while if an RL agent is certain the same or similar actions are always performed. Or phrased differently, an RL agent is considered to be certain in its decision in state \(S\) if it is easy to predict the RL agent's action \(A\).

To determine the uncertainty of a decision for a state \(S\), the evenness of the probability distribution over actions \(A\in\mathcal{A}\) is calculated. The probability distribution is approximated as:

\begin{align*}\hat{\pi}(S,A)=n(S,A)/n(S).\end{align*}

Considering the observed trajectory of time steps, \(n(S)\) is the number of times the RL agent was faced with state \(S\), while \(n(S,A)\) is the number of times it executed action \(A\) after observing \(S\). The evenness \(e(S)\) and thus uncertainty for state \(S\) is then computed as:

\begin{align*}e(S)=-\sum_{A\in\mathcal{A}}\hat{\pi}(S,A)\ln\hat{\pi}(S,A)/\ln|\mathcal{A}|.\end{align*}

An evenness of \(e(S)=1\) indicates maximum uncertainty, while \(e(S)\) close to zero indicates certainty.

Minima and maxima: This Interestingness Element identifies where RL decisions led to a maximum or minimum reward, thereby helping to identify favorable and adverse situations for the RL agent. Maxima elements help understand how well the RL agent may perform with respect to its learning goal. Minima elements help understand how well the RL agent handles difficult situations.

To determine minima and maxima, an estimate of the value function \(V(S)\) is employed. \(V(S)\) indicates the maximum value (expected reward) that can be achieved when taking a decision in state \(S\) and is computed from the action-value function \(Q(S,A)\) as follows:

\begin{align*}V(S)=\mathrm{max}_{A\in\mathcal{A}}Q(S,A).\end{align*}

States with local minima \(\mathcal{S}_{\mathrm{min}}\) and local maxima \(\mathcal{S}_{\mathrm{max}}\) are computed as follows:

\begin{align*}\mathcal{S}_{\mathrm{min}}=\{S\in\mathcal{S}:\forall S^{\prime}\in \mathcal{S}^{\prime}:V(S)\leq V(S^{\prime})\};\mathcal{S}_{\mathrm{max }}=\{S\in\mathcal{S}:\forall S^{\prime}\in\mathcal{S}^{\prime}:V(S) \geq V(S^{\prime})\}.\end{align*}

\(\mathcal{S}^{\prime}\) is computed as follows:

\begin{align*}\mathcal{S}^{\prime}=\{S^{\prime}\in\mathcal{S}:\exists A\in\mathcal{A }:\hat{\mathbb{P}}(S^{\prime}|S,A) \gt 0\}.\end{align*}

\(\hat{\mathbb{P}}(S^{\prime}|S,A)\) is the probability of observing \(S^{\prime}\) when executing action \(A\) in state \(S\) and is estimated by

\begin{align*}\hat{\mathbb{P}}=n(S,A,S^{\prime})/n(S,A).\end{align*}

Considering the observed trajectory of time steps, \(n(S,A,S^{\prime})\) is the number of times state \(S^{\prime}\) was visited after the RL agent executing action \(A\) in state \(S\).

When applied to adaptive systems, Interestingness Elements thus help reveal the RL agent's confidence in its decisions, and thereby can aid in debugging the reward function. As mentioned above, RL does not completely eliminate manual development effort as it requires to define a suitable reward function that helps the adaptive system to learn which adaptation to perform in which state. Accordingly, rewards quantify whether a chosen adaptation was a good one or not. If the RL agent is uncertain in a given state or only achieves rather low maxima this may point to a problem with how the reward function was defined; e.g., it may be that one needs to provide stronger rewards to guide the learning process towards more effective adaptations [36, 69].

3.2 “Reward Channel Dominance” DINE

Building on Reward Decomposition, “Reward Channel Dominance” DINEs provide the information of how each sub-agent would influence each possible action \(A\in\mathcal{A}\) in the given state \(S\). “Reward Channel Dominance” DINEs thereby help understand why a concrete adaptation was chosen in a given state, and why not another possible adaptation was chosen. We introduce two types of “Reward Channel Dominance” DINEs:

Absolute Reward Channel Dominance follows the original approach of Reward Decomposition and gives the action-values \(Q_{k}(S,A)\) for each action \(A\in\mathcal{A}\) of each reward channel \(k\) for a given state \(S\in\mathcal{S}\). Recall that the action values give the expected cumulative reward when executing action \(A\) in state \(S\) (see Section 2.1) and thereby determine which adaptation is chosen during exploitation in state \(S\). Figure 5(a) shows an example of how this type of DINE is visualized in the XRL-DINE dashboard. The action values of the different reward channels are stacked for each possible action \(A\in\mathcal{A}\), with the height of the bar indicating the aggregated action-value.

Since each of the sub-agents has its own reward function \(R_{k}\) and thus receives different kinds of rewards, the action values of the different sub-agents may have different ranges (e.g., including even negative values). This makes comparing these action values and thus understanding the reason for the aggregated RL decision difficult. As an example, in Figure 5(a) it appears that reward channel 1 dominates the decision for all the Actions 1–5. Especially when considering that in many typical adaptive systems adaptations are discrete (e.g., offering different feature combinations of the system; see [38]), a small difference in action-values may have a profound impact on the adaptive system, as quite a different adaptation is selected among the possible discrete adaptations.

Relative Reward Channel Dominance helps to better understand the individual contributions of the reward channels to the overall decision by converting the absolute action-values \({Q}_{k}(S,A)\) into relative action-values \(\tilde{Q}_{k}(S,A)\). To compute these relative action-values, for each reward channel \(k\), the action-value of the worst-performing action \(A_{k,0}=\mathrm{min}_{A\in\mathcal{A}}Q_{k}(S,A)\) is subtracted from the action-values of all the other actions, i.e.,

\begin{align*}\forall A\in\mathcal{A}:\tilde{Q}_{k}(S,A)=Q_{k}(S,A)-Q_{k}(S,A_{k,0}).\end{align*}

Figure 5(b) shows an example of this type of DINEs, which clearly shows that Action 1 is chosen by the aggregated agent, as it has the highest relative aggregated rewards (i.e., tallest bar). Here, reward channel 1 contributed most to the aggregated decision. Note the different scale (y-axis) of Figure 5(b) due to the use of relative action-values \(\tilde{Q}(S,A)\), which helps focus on the decisive contributions to the aggregated decision. While the same insights could be derived from Figure 5(a), the distinction is more difficult to make as the absolute action-values \(Q(S,A)\) lie much closer to each other.

Fig. 5.

3.3 “Uncertain Action” DINE

Applying the original Interestingness Elements approach directly to value-based Deep RL may return misleading results. After convergence of the learning process (e.g., achieved via \(\epsilon\)-decay; see Section 2.1), the action-value function \(Q(S,A)\) remains more or less stable as the RL agent will mainly perform exploitation. During this exploitation, for a given state \(S^{*}\), always the same action \(A^{*}\) (the one with the highest action value) will be chosen. As a result, any given state \(S^{*}\) would be considered certain, as the estimated probability distribution would be very uneven, only peaking at \(A^{*}\). Thereby, even if the action values only marginally differ from the action value for \(A^{*}\), a situation which intuitively would be considered uncertain, the original approach would not determine this as uncertain.

To overcome this weakness, XRL-DINE calculates “Uncertain Action” DINEs by using normalized action values \(\hat{Q}(S,A)\) instead of the estimated probability distribution \(\hat{\pi}(S,A)\). Normalized action values are computed by first making all action values positive, i.e.,

\begin{align*}Q^{+}(S,A)=Q(S,A)-\textrm{min}(0,\textrm{min}_{a\in\mathcal{A}}Q(S,a)).\end{align*}

Using these positive action values, the normalized action values are computed as

\begin{align*}\hat{Q}(S,A)=\frac{Q^{+}(S,A)}{\sum_{a\in\mathcal{A}}Q^{+}(S,a)}\end{align*}

The evenness calculation follows the approach proposed by Sequeira and Gervasio [58] but rather uses the normalized action values, i.e.,

\begin{align*}e(S)=-\sum_{a\in\mathcal{A}^{+}_{S}}\frac{\hat{Q}(S,a)\cdot\ln\hat{Q}(S,a)}{ \ln|\mathcal{A}|}, \end{align*}

where \(\mathcal{A}^{+}_{S}=\{a | Q(S,a) \gt 0, a\in\mathcal{A}\}\)

To tune the number of DINEs generated, we introduce the hyper-parameter \(\rho\) (also see Section 3.6). This hyper-parameter prescribes how much evenness is required to consider a state uncertain. A state \(S\) is considered uncertain, if \(e(S)\geq\rho\).

To combine this with Reward Decomposition, the relative action values are calculated for each reward channel. Only if a relevant evenness is found for at least one of the reward channels and also the action of the aggregated RL agent does not correspond to the action that the sub-agent would choose, this identifies an “Uncertain Action.”

Figure 6 shows an example of the visualization of these DINEs. As can be seen, the first two RL decisions at the beginning of the trace are certain, with the RL agent deciding for exactly one action. This is followed by two uncertain actions, with one or even two alternative actions that only marginally differ in their expected reward from the actual action taken. For the selected action we use the color beige. The colors of the alternative actions match the colors defined for the respective reward channel that dominates the selection of alternative action (see Figure 5).

Fig. 6.

3.4 “Contrastive Action” DINE

In addition to the graphical visualization, the information of the “Uncertain Action” DINEs and “Reward Channel Dominance” DINEs can also be combined and used to generate a “Contrastive Action” DINE. A “Contrastive Action” DINE provides contrastive explanations in which the action that would have been chosen by the sub-agent in isolation represents the contrastive action. XRL-DINE provides natural-language text for such an explanation following the below template:

To reach the goal \(\lt\)contrastive reward channel\(\gt\), I should actually choose action \(\lt\)contrastive action\(\gt\). However, it is currently more important to choose action \(\lt\)action chosen by aggregated RL agent\(\gt\) to achieve the goal \(\lt\)reward channel that dominated aggregated decision\(\gt\).

For example, a “Contrastive Action” DINE could be:

To reach the goal of Reward Channel 2, I should actually choose Action 1. However, it is currently more important to choose Action 4 to achieve the goal Reward Channel 1.

3.5 “Reward Channel Extremum” DINE

This type of DINEs is based on the “minima and maxima” Interestingness Elements. By combining with Reward Decomposition, “Reward Channel Extremum” DINEs provide an insight into the individual sub-agents’ trajectory of decisions and what actions are taken by the RL sub-agents to leave a local reward minimum as quickly as possible or to maintain a local reward maximum.

The original Interestingness Elements approach envisioned computing minima and maxima based on a past trajectory of RL agent interactions. However, as we aim to explain Online RL for adaptive systems, we expand the original notion to be capable of computing minima and maxima at run time. To determine whether the current state \(S\) may be considered a minimum or maximum, we thus need to predict the next state \(S^{\prime}\) for each possible action \(A\in\mathcal{A}\). The value-based deep RL approaches we focus in this paper (see Section 2) belong to the class of model-free RL algorithms. This means that they do not use a model of the environment, which otherwise may be used to predict the next state \(S^{\prime}\). We thus suggest approximating an environment model to be used by XRL-DINE. As such an approximation depends on the concretely chosen RL algorithm, we discuss this further as part of our proof-of-concept implementation in Section 4.

To generate “Reward Channel Extremum” DINEs, the next states \(\hat{\mathcal{S}}^{\prime}\) for all possible actions \(A\in\mathcal{A}\) are predicted from the current state \(S\) using the approximated environment model. Since local extrema may occur quite often, XRL-DINE offers the hyper-parameter \(\phi\) to control the number of DINEs generated (also see Section 3.6). Accordingly, states with local minima \(\mathcal{S}_{k,\mathrm{min}}\) and local maxima \(\mathcal{S}_{k,\mathrm{max}}\) are computed as follows:

\begin{align*}\mathcal{S}_{k,\mathrm{min}}=\{S\in\mathcal{S}:\forall\hat{S}^{\prime} \in\hat{\mathcal{S}}^{\prime}:(V_{k}(\hat{S}^{\prime})-V_{k}(S)) \gt \phi\};\mathcal{S}_{k,\mathrm{max}}=\{S\in\mathcal{S}:\forall\hat{S}^{ \prime}\in\hat{\mathcal{S}}^{\prime}:(V_{k}(S)-V_{k}(\hat{S}^{\prime})) \gt \phi\}.\end{align*}

The value function \(V_{k}(S)\) of a sub-agent is computed as

\begin{align*}V_{k}(S)=\mathrm{max}_{A\in\mathcal{A}}Q_{k}(S,A).\end{align*}

Figure 7 shows an example of the visualization of these DINEs. As can be seen, reward channel 1 rather quickly reaches a local maximum when compared to the other reward channels. Yet, also rather quickly, the reward in channel 1 drops—even below the observed previous local minimum—suggesting an important change in the RL agent's environment, which has not yet been sufficiently learned by reward channel 1.

Fig. 7.

3.6 XRL-DINE Hyper-Parameters

As introduced above, XRL-DINE offers two hyper-parameters to tune the explanation generation process. Here, we indicate how the setting of these hyper-parameters influences the number of DINEs generated, and we experimentally assess their influence.

—

\(\rho\in[0,1]\): The lower this hyper-parameter is set, the more “Uncertain Action” DINEs will be generated. In contrast, for higher values of this hyper-parameter, none or only a few, but possibly more relevant DINEs are generated. For example, if longer traces of RL agent interactions should be explained, it can help to select a lower value for \(\rho\) as this keeps the number of DINEs and thus the cognitive load on the explainee manageable.

—

\(\phi \gt 0\): The frequency and amount of “Reward Channel Extremum” DINEs can be controlled by \(\phi\). In situations where local extrema occur rather often, selecting a higher \(\phi\) reduces the number of DINEs generated.

To illustrate how the setting of these hyper-parameters may impact on the number of DINEs generated, Figure 8 shows concrete results for the adaptive system exemplar used in our study (see Section 5). We measure how the number of “Uncertain Action” DINEs depends on \(\rho\), and how the number of “Reward Channel Extremum” DINEs depends on \(\phi\). We perform measurements for covering a total of 62,000 time steps. To facilitate comparability of results, we use data from a single run of the RL agent but filter accordingly based on the different hyper-parameter settings.

Fig. 8.

As can be seen in the charts, the hyper-parameters allow tuning the rate of generated DINEs in a wide range, e.g., from close to zero up to 100% in case of \(\rho\). Thereby, the XRL-DINE hyper-parameters help address different explanation needs of explainees; e.g., coarse-grain observation vs. in-depth debugging. Note that these hyper-parameters can even be changed at run time, thereby facilitating the explainees to dynamically tune the rate of DINEs.

4 Proof-of-Concept Implementation

Following established practices in adaptive systems research [50], we perform real-world experiments to validate XRL-DINE. This means we implement a system (also see Section 1.4) and then subject it to approximately realistic conditions. Figure 9 shows the main components of the XRL-DINE proof-of-concept implementation – the XRL-DINE engine and the XRL-DINE dashboard—and how they connect with the realization of the adaptation logic using Deep RL. The split of XRL-DINE into two components provides a separation of concerns along the two phases of the explanation process, as introduced in Section 2.2. The XRL-DINE engine is responsible for determining the causes of the RL agent's decision, while the XRL-DINE dashboard is responsible for the knowledge transfer from the explainer to the explainee.

Fig. 9.

4.1 Adaptation Logic

We realize the adaptation using Double Deep Q-Networks with Experience Replay as a state-of-the-art variant of value-based deep RL algorithms (see the details in [21]). One particular benefit of this RL variant is that it addresses the problem of sparse rewards, i.e., that the RL agent only sporadically may receive rewards. Experience Replay means that data observed during agent-environment interactions are not used directly for training but are first written to a replay memory. To perform a policy update, a predefined number of samples (called batch) is randomly taken from the replay memory to update the weights of the artificial neural network that approximates the action-value function \(Q(S,A)\). We extend this Deep RL algorithm to the decomposed case with the update rule proposed in [23].

This realization of the adaptation logic is connected to the adaptive system and its environment using the OpenAI Gym interface, a widely used type of interface between RL agents and their environments.¹ Classes that implement this interface offer a “step” method to which the action to be executed is passed. This action is then executed, and the next state and the reward received are returned. Since Reward Decomposition requires a vector of reward values instead of the default scalar reward value provided by the OpenAI Gym interface, we have slightly modified the interface.

4.2 XRL-DINE Engine

As shown Figure 9, the Deep RL algorithm that realizes the adaptation logic is linked to the XRL-DINE Engine as follows.

To compute the DINEs as explained in Secure 3, the action values \(Q(S,A)\) are transmitted to the XRL-DINE engine at each time step via a callback.

To compute the “Reward Channel Extremum” DINEs, additionally samples of the replay memory are transmitted to the XRL-DINE engine to learn an approximate environment model. We use a simple feed-forward neural network from the PyTorch² library by exploiting the contents of the replay memory as a labeled dataset. The replay memory contains past transitions in the form \((S,A,R,S^{\prime})\) (i.e., State-Action-Reward-Next State [60]). Using \(S\) and \(A\) as input and \(S^{\prime}\) as output, we train a generic feed-forward neural network as a universal nonlinear function approximator. During the initialization of the XRL-DINE engine, links to the replay memory (for training the environment model) and the feed-forward neural networks of the individual RL sub-agents (for evaluating successor states) are passed by reference. Of course, training an accurate environment model requires sufficient amount of training data. Such training data is not available at the start of the RL learning process and it takes some time until sufficient data is collected. Yet, this is not a critical concern for the applicability of XRL-DINE. We are interested in explaining the RL agent's behavior after initial convergence of the learning process, which typically takes many iterations (e.g., in the order of several 10,000 iterations, like in our exemplar from Section 5) and thus a significant amount of data has been collected when needed.

4.3 XRL-DINE Dashboard

The DINEs generated by the XRL-DINE Engine are visualized and contextualized in the XRL-DINE dashboard. The dashboard allows navigating the decision trajectory and thereby also facilitates investigating explanations of past interactions. The purpose of this type of visualization is to preserve the respective advantages of the two combined explanation techniques: Being able to gain an understanding of the RL agent's higher-level behavior, while also being able to investigate specific actions of interest.

The dashboard is shown in Figure 10. It follows an interaction concept that allows visual data exploration [24] and is centered around time progression as main axis. The different visual elements are linked so that hovering over one element highlights the information of other elements for the same time step. By selecting a specific time step, the reward dominance values are calculated for that time step and are displayed as bar charts. We took into account previous experience on user-centered design [15]. To reduce cognitive overload, by (1) tuning the number of DINEs via XRL-DINE's hyper-parameters (see Section 3.6), and by (2) only showing relevant information when needed. To avoid the requisite memory trap (i.e., relying on limited short-term memory), we show the complete historical information in a compact representation.

Fig. 10.

In detail, the XRL-DINE dashboard shows:

(1)

State progression, which is represented using a line graph. Each line in the dashboard represents a Z-score standardized state variable (for numeric variables). Standardization is necessary to display differently scaled variables. In principle, ordinal or categorical states could also be shown, which is not yet realized in our prototype.

(2)

Received rewards progression for each reward channel together with the “Reward Channel Extremum” DINEs.

(3)

Trajectory of selected actions, where adaptations chosen by the composed RL agent are shown.

(4)

Uncertain Actions, for which the background color corresponds to the color of the particular reward channel for which the contrastive action is considered important. If multiple reward channels suggest an action different from the selected action of the aggregated RL agent, they are stacked on top of each other. “Contrastive Action” DINEs are shown when hovering the mouse over an “Uncertain Action” DINE.

(5)

Reward Channel Dominance for a selected time step is displayed in a stacked column chart. Each column represents a possible adaptation. The reward channels are shown in the same colors as for the other DINEs.

5 Adaptive System Exemplar

We apply XRL-DINE to a concrete adaptive system exemplar to serve as basis for our user study. We chose the Simulator for Web Infrastructure and Management (SWIM) as an exemplar, which is one of the adaptive system exemplars provided by the SEAMS community [43]. SWIM simulates an adaptive multi-tier web application. It closely replicates the real-time behavior of an actual web application while allowing to speed up the simulation to cover longer periods of real time. SWIM instantiates the auto-scaling problem in which the objective is to provision resources to satisfy conflicting business goals. Specifically, adaptation logic must be implemented that maximizes a given utility function, despite varying system load.

SWIM has different monitoring metrics to determine the state \(S\) of the system. These monitoring metrics include (1) the request arrival rate (i.e., “workload”), (2) the average throughput, and (3) response time. As all three environment variables are continuous, SWIM's state space is continuous. Therefore, tabular RL solutions (see Section 2.1) cannot be directly applied to the exemplar.

SWIM can be adapted via the following different adaptation actions \(A\): (1) additional web servers can be added/removed, resulting in the load being distributed across more / fewer servers; (2) the proportion of requests for which optional, computationally intensive content is generated (e.g., via recommendation engines) can be modified by setting a so called dimmer value.

While both types of adaptations have an impact on user satisfaction (due to their influence on throughput and response time), adaptations of type (1) have an impact on costs (due to the costs of more/fewer servers), and adaptations of type (2) have an impact on revenue (due to recommendations leading to potential further purchases in the webshop).

5.1 Decomposed Reward Function

To apply XRL-DINE, we took a reward function from the literature [42] and split it into three reward channels to form the following decomposed reward function:

\begin{align*}R=a\cdot R_{\mathrm{user\_satisf.}}+b\cdot R_{\mathrm{revenue}}+c\cdot R_{ \mathrm{costs}}.\end{align*}

The weights were selected experimentally according to two criteria. First, no reward channel should dominate the other two reward channels to such an extent that the decisions of the other sub-agents have no influence on the choice of actions. The second subordinate criterion is to conform as closely as possible to the original utility function. Specifically, the parameters were chosen such that User Satisfaction has the highest influence (\(a=4\)), Revenue has the second highest influence (\(b=2\)), and Costs has the lowest influence (\(c=1\)).

The three reward sub-functions are defined as follows:

\begin{align*}R_{\text{user_satisf.}}=\begin{cases}0.5 &: x\leq 0.02\\-0.5-\frac{x-1}{20} &: x\geq 1\\0.5-\frac{x-0.02}{0.98} &: \text{ otherwise }\end{cases}.\end{align*}

In \(R_{\textrm{user_satisf.}}\), the perceived user satisfaction depends on the average latency \(x\). This utility function is provided as part of the SWIM exemplar.

\begin{align*}R_{\textrm{revenue}}=\tau\cdot a\cdot\left(d\cdot R_{O}+(1-d)\cdot R_{M}\right).\end{align*}

In \(R_{\textrm{revenue}}\), \(\tau\) is the length of the time interval between two consecutive time steps, \(a\) is the average arrival rate of requests and \(d\) is the current dimmer value. \(R_{M}\) is the reward obtained when processing a request without optional content. \(R_{O}\) is the reward obtained when processing a request with optional content. The term \(d\cdot R_{O}+(1-d)\cdot R_{M}\) thus represents the average reward controlled by the dimmer value for each request.

\begin{align*}R_{\textrm{costs}}=-(\tau\cdot c\cdot s).\end{align*}

In \(R_{\textrm{costs}}\), \(s\) indicates the number of servers currently in use. The parameter \(c\) models the cost of using one server. This means, \(R_{\textrm{costs}}\) is higher the fewer servers are used.

Finally, all actions that cause a change in the system state (i.e., all actions except “No Adaptation”) are penalized by a reward of \(-0.1\) to account for increased computational overhead and to incentivize the aggregated RL agent to learn a less jerky policy.

5.2 Selected Scenario

To keep the cognitive load on the participants manageable and to keep the duration of the user study within reasonable limits (see the validity risk discussion in Section 8.6), we opted to select a single concrete scenario. We selected this concrete scenario based on two criteria:

First, we wanted to keep the number of time steps relatively small while still covering different possible types of decisions. In the end, we decided to limit the scenario to 21 consecutive time steps. Keeping the number of time steps low had the additional benefit that the XRL-DINE hyper-parameters only have a small effect on the number of DINEs shown. We thereby could keep the default values for these hyper-parameters as defined in the prototypical implementation, i.e., \(\rho=0.3\) and \(\phi=0.1\), and thus did not have to consider them as independent variables in our study.

Second, we wanted to understand how XRL-DINE helps to explain the decision-making of an already trained RL agent (i.e., an agent where we mainly see exploitation and not random exploration; see Section 2.1). We thus start from a trained RL agent, for which we have experimentally measured convergence for a set of training data (a selected workload trace) by observing when average rewards stabilized. We then apply this trained RL agent to a new workload trace and randomly select 21 consecutive time steps towards the beginning of this new trace.

Our chosen scenario³ begins at time step \(T+1=\) 22,575.

5.3 Applying XRL-DINE to the Selected Scenario

Figure 11 shows the contents of the main part of the XRL-DINE dashboard when applied to this scenario. The figure shows how the RL agent responds to an increase in user requests. This increase is reflected in the red curve in the XRL-DINE dashboard. This curve represents the normalized duration between incoming requests. The lower this value, the higher the request rate (i.e., the request rate is the inverse of the normalized duration between incoming requests). At time step \(T+2\), there is a short peak in the request rate, which then decreases again. Between time steps \(T+4\) and \(T+6\), the request rate increases again and then more or less remains at this higher level for the remainder of the example. The number of active servers is ten at time step \(T+1\), but is lowered by the RL agent down to eight servers by time step \(T+5\). The dimmer value is at 0.1 at time step \(T+1\) and is reduced by the RL agent to 0.0 by time step \(T+7\).

Fig. 11.

The curve of the User Satisfaction reward channel depicts an increase between time steps \(T+1\) and \(T+3\), which decreases again due to the shutdown of two servers after time step \(T+3\). By lowering the dimmer value in time steps \(T+4\) and \(T+6\) and due to a more or less stable request rate, the reward for the User Satisfaction reward channel increases and stabilizes at a high level from time step \(T+6\) onward. Lowering the dimmer value also causes the reward for the Revenue channel to decrease and stabilize at a lower level. In contrast, the reward for Costs increases, since two less servers need to be operated. By choosing No Adaptation between time steps \(T+7\) and \(T+13\), all reward channels receive a relative boost, as the \(-0.1\) penalty for choosing any action except No Adaptation is no longer received. In summary, the aggregated RL agent's strategy is to respond to an increase in the request rate by lowering the dimmer, thus trading a gain in User Satisfaction and Costs for a loss of Revenue.

Between time steps \(T+2\) and \(T+13\), the Revenue sub-agent decides to activate more servers. Yet, this decision is never taken by the aggregated RL agent until time step \(T+13\), where the action chosen by the RL agent then is Add Server. The Costs sub-agent, in contrast, regularly wants to turn off more servers in the second half of the example when the request rate drops again. This is in direct contradiction to the decision of the Revenue sub-agent.

To better understand the aggregated agent's internal decision-making before the aggregated decision changes to “Add Server,” we look at the “Reward Channel Dominance” DINEs for time step \(T+12\), shown in Figure 12. This DINE shows how the agent arrived at the decision to perform Add Server.

Fig. 12.

The action taken at time step \(T+12\) is the last action before actually adding another server. In addition to the Revenue sub-agent, the Costs sub-agent also proposes an alternate action at this time step. The “Relative Reward Channel Dominance” DINE shows that the sub-agent for User Satisfaction has the greatest influence on the selected action No Adaptation. The other two sub-agents each show visibly less reward channel dominance. This imbalance suggests a strongly biased action selection. It can also be seen that the chosen action is closely followed by the second-best action Add Server. The alternative action Remove Server proposed by the Costs sub-agent is significantly worse, with visibly less reward compared to No Adaptation.

In the example, the aggregated agent decides to sacrifice Revenue for higher User Satisfaction and lower Costs. Meanwhile, the sub-agent for Revenue suggests alternative actions. This is something one may expect when having knowledge of the domain. These suggestions all relate to the Add Server action. Adding more servers results in a lower average server utilization. This lower utilization would allow the dimmer to be raised again without compromising the dominant User Satisfaction reward. Thus, from a domain perspective, these alternative actions make sense. The proposed alternative actions of the Costs sub-agent also make sense from a domain point of view since removing servers intuitively leads to lower server costs.

Summarizing, DINEs in this example facilitate explicitly representing the RL agent's internal decision-making. From a domain point of view, one can think of three reasonable strategies for responding to the unanticipated increase in request rate shown in the example: (1) more servers can be added to ensure a high level of Revenue and User Satisfaction but at the expense of Costs, (2) the dimmer value can be lowered to ensure a high level of User Satisfaction and Costs at the expense of Revenue, or (3) no action at all can be taken at the expense of User Satisfaction. In the chosen example, the agent follows the second strategy.

When XRL-DINE is used for debugging, software engineers can evaluate whether this strategy is what they expected, and if not, may change the reward function, such as to prefer the alternative strategy suggested by the Revenue sub-agent. Of course, there may also be other causes for buggy behavior than a wrong reward function. For example, the state space may have been captured insufficiently (missing key state variables or including too many irrelevant ones), or the adaptation space may not have been completely defined (missing relevant adaptations).

6 User Study Design

This section provides the research questions to be answered by our user study and the study design for each of these research questions.

6.1 Research Questions

RQ1—What is the participants’ performance in answering concrete questions related to the decision-making of Deep RL? The assessment of the quality of explanations is an active field of research [57]. Different techniques for assessing the quality of explanations exist, which include asking for a summary of the given explanation in someone's own words, observing humans’ behavior when performing distinct tasks, or measuring the human task performance when using explanations. In our user study, we are interested in measuring task performance by asking participants to answer dedicated questions related to the decision-making of Online Deep RL. In particular, we compare the task performance of a treatment group, which is given access to XRL-DINE, and a control group, which is given no access. We measure task performance in terms of effectiveness (rate of correctly solved tasks) and efficiency (number of correctly solved tasks per time). Accordingly, RQ1 entails the following sub-questions:

—

RQ1a—What is the participants’ effectiveness, and how does it differ between the treatment and control group?

—

RQ1b—What is the participants’ efficiency, and how does it differ between the treatment and control group?

RQ2—How does the participants’ self-assessed confidence relate to the participants’ performance? The participants are asked to self-assess their confidence in having correctly completed each task, thereby providing a complementary view to task performance measured in RQ1. Self-assessment measures the perceived understanding of participants [8], which can indicate how confident they are in their gained knowledge (here: about the decision-making of the Online Deep RL agent). Self-assessed confidence may help identify areas of uncertainty in the explanations provided, as made evident by lower task performance. Accordingly, RQ2 entails the following sub-questions:

—

RQ2a—What is the participants’ self-assessed confidence, and how does it differ between the treatment and control group?

—

RQ2b—How is the correlation between the participants’ self-assessed confidence and their effectiveness?

RQ3—How do the participants in the treatment group perceive the usefulness and ease of use of XRL-DINE? We are interested in how the participants perceive the usefulness and ease of use of XRL-DINE, which may provide indicators for the potential acceptance of XRL-DINE in practice and may exhibit opportunities for improvement. We employ the Technology Acceptance Model (TAM) to measure these indicators [9]. TAM focuses on the technology users’ perception and on how likely they are to adopt and continue using the technology at hand.

RQ4—How did the participants in the treatment group use XRL-DINE? In this part of the study, we ask participants which part(s) of the XRL-DINE dashboard they used to solve a task. This helps analyze whether certain parts (and thus specific DINEs) are more popular than others for solving a task. Also, we ask the participants to identify problems in using the XRL-DINE dashboard. Together, this provides potential directions for future enhancements. Accordingly, RQ4 entails the following sub-questions:

—

RQ4a—Which XRL-DINE dashboard parts did the participants use?

—

RQ4b—What problems did the participants report when using the XRL-DINE dashboard?

While RQ1 and RQ2 ask in how far XRL-DINE facilitates the study participants’ understanding concerning the decision-making of the Online Deep RL agent, RQ3 and RQ4 ask about the usage and acceptance of XRL-DINE.

6.2 Overall Study Design

The overall aim of our user study is to analyze in how far XRL-DINE may help software engineers to understand the decision making of an adaptive system realized using Deep Online RL. To this end, our user study follows the approach of a human-grounded evaluation [13], which means that we conduct human-subject experiments involving simplified tasks that maintain the essence of the actual real-world application. Like in the user study of Sequeira and Gervasio [58], we use an exemplar (here: SWIM) to evaluate the output of the respective explainable ML technique (here: XRL-DINE).

We set up the user study in the form of an online questionnaire (for details, see Section 7) because this facilitates repeatability (i.e., each participant always receives the same form of questions), simplicity (i.e., no need for filling and scanning paper-based results), and scalability (i.e., easy to share and distribute to additional participants) [27]. Moreover, by using an online questionnaire, the order of possible answers can be chosen randomly for each participant reducing the risk of response bias and order effects [27]. Another advantage is that the participants can participate in the study independently from each other and without being supervised. Finally, an online questionnaire allows us to provide the same information to each participant.

Below, we introduce the design for each of the research questions and provide the questions that were posed in the online questionnaire together with our hypotheses.

6.3 RQ1 Design (Task Performance)

RQ1 is about how software engineers perform when answering concrete questions concerning the decision-making of Online Deep RL. We split the participants into two groups. The treatment group is using XRL-DINE, including all its DINEs, while the control group does not have access to any of the DINEs but is only given the basic information about the Online Deep RL agent (such as its states, actions and rewards).

Such a comparative user study appears to be unfair in principle. For the participants in the control group, it is very challenging or even impossible to fully understand the decision-making of the Online Deep RL agent (e.g., see Section 2.1). They may be lucky in doing some educated guessing (based on the basic information received), but have no realistic chance to solve tasks that entail the underlying decision-making policy of the Deep Online RL. Still, such a comparative study provides one essential insight. It contextualizes the results of the treatment group against a baseline, i.e., the results of the control group. Thereby, we can assess how “good” the results of the treatment group and thus the use of XRL-DINE are.

Since our user study is executed in the context of a concrete adaptive system (SWIM), domain knowledge may help participants (of the control group and also the treatment group) to find the correct answers via educated guessing. To reduce the ability for educated guessing, one may remove all domain concepts from the information provided to the participants, e.g., by changing labels like User Satisfaction into XYZ. However, a study without domain context does not conform to the principle of human-grounded evaluation [13] and may also introduce new problems, such as participants having problems with handling the abstract labels. We thus decided to keep the domain concepts and accept the risk that participants—of both groups—may do some educated guessing.

Following the software engineering literature (e.g., see [6, 49]), we measure human task performance in terms of effectiveness (RQ1a) and efficiency (R1b) as follows:

\begin{align*}\textrm{effectiveness}=\frac{\textrm{Number of correctly performed tasks}}{ \textrm{Number of all tasks}}=\frac{\sum c_{i}}{n},\\ \textrm{efficiency}=\frac{\textrm{Number of correctly performed tasks}}{ \textrm{Time for performing all tasks [Minutes]}}=\frac{\sum c_{i}}{\sum t_{i}}.\end{align*}

Here, \(c_{i}\) is equal to 1 if the \(i\)-th task (\(i=1,\ldots,n\)) was performed correctly, and 0 otherwise, while \(t_{i}\) is the time spent (in minutes) for performing the \(i\)-th task.

Questions. We ask the participants to perform eight tasks clustered into three task groups as shown in Figure 13. The aim of each task is to answer a concrete question concerning the decision-making of the Online Deep RL agent. The tasks cover different aspects of RL's decision-making. The participants had to solve the tasks and give a single-choice answer from a list of between three and seven answer choices.

Fig. 13.

Task Group 1: Tasks in this group are about determining why the adaptive system is in a specific state. In particular, participants have to differentiate between actions in general (i.e., every decision made by the Online Deep RL agent, which also includes “No Adaptation”; e.g., see Figure 11) and actual adaptations (i.e., RL decisions that lead to a change of state).

Task Group 2: Tasks in this group are about determining why the Online Deep RL agent made a particular decision at a given time step and how certain the agent was about this decision compared with possible alternative decisions. In particular, participants had to grasp that the Deep Online RL agent may sometimes be rather uncertain and, in such situations, identify possible alternative decisions.

Task Group 3: Tasks in this group are about determining which goals are pursued by the Online Deep RL agent and what different sub-goals may have been pursued when taking alternative decisions.

Hypotheses. We hypothesize that the effectiveness (RQ1a) and efficiency (RQ1b) of the treatment group is significantly better than the effectiveness and efficiency of the control group.

To determine whether there is a statically significant difference between the results of the treatment and control group [11], we use the Mann–Whitney U test, a nonparametric test, i.e., not assuming normal distribution of the results⁴. In addition to the \(p\) value, which indicates how likely the observed difference between the treatment and control group is due to chance, we also provide the \(U\) value, which indicates the size of the difference between the results. The smaller \(U\), the higher the probability that there is a difference. We also provide the effect size \(r\), which gives a normalized indication on how large the difference between the treatment and control group is (ranging from \(|r| \lt 0.1\) indicating a small effect to \(|r|=0.5\) indicating a large effect). We adopt a significance level of \(\alpha=0.05\) for all statistical tests. To execute the statistical tests we used DATAtab,⁵ an online statistics calculator. DATAtab allows to import the raw user study data from Microsoft Excel. The raw user study data is part of the reproduction package provided (see Section 1.4).

6.4 RQ2 Design (Participants’ Confidence)

We complement the measured task performance from RQ1 by asking participants to assess how confident they are in having completed the tasks correctly (similar to [8]). Such self-assessed confidence may indicate where the participants of the treatment group have problems with understanding the XRL-DINE explanations, and it can indicate where the participants of the control group have difficulties in understanding the decision-making of Online Deep RL with the basic information they received.

Questions. Following each concrete task from RQ1, the participants are asked how confident they are in having solved the previously given task correctly. To provide their response, participants are given a 4-point scale: Very unsure, A little unsure, Mostly confident, Very confident. We opted against a 5-point scale with a neutral answer to prevent central tendency bias.

Hypotheses. We hypothesize that the self-assessed confidence of the treatment group is significantly higher than the self-assessed confidence of the control group (RQ2a) because the treatment group can resort to XRL-DINE for giving their answers, while the control group may have to rely on (educated) guessing. Also, we hypothesize that there is a significant positive correlation between self-assessed confidence and effectiveness, i.e., the rate of correct answers (RQ2b).

To determine whether there is a statistically significant difference between the results of the treatment and control group, we use the Mann–Whitney U test like for RQ1. To determine the correlation between the self-assessed confidence and effectiveness, we use the nonparametric Spearman Rank Correlation coefficient, as we cannot assume normal distribution of the data (see RQ1 above). In addition to the correlation coefficient \(r\), we provide the \(p\) value, which indicates how likely the observed correlation is due to chance.

6.5 RQ3 Design (Perceived Usefulness and Ease of Use)

To measure the potential acceptance of XRL-DINE, we used the TAM [9]. TAM has been widely studied and emerged as a de-facto standard for analyzing technology acceptance [33]. The advantages of TAM are its low complexity, empirically validated measurement scales, as well as its robustness [26, 44]. Research has also shown that the TAM can be used to evaluate software prototypes [10, 29, 44]. We use TAM to measure how the participants of the treatment group perceive the usefulness and ease of use of XRL-DINE. Perceived usefulness gives insights into how useful participants’ perceive XRL-DINE in supporting them to perform tasks related to the decision-making of an Online Deep RL agent. Perceived ease of use provides an indicator of how usable, i.e., easy to use, XRL-DINE was perceived.

Questions. Figure 14 shows the TAM questions, which are based on the typical TAM questions but were slightly modified to fit the scope of the user study. Consistent with the TAM methodology, answers to each question are to be given on a 5-point scale: extremely unlikely, quite unlikely, neither, quite likely, and extremely likely.

Fig. 14.

6.6 RQ4 Design (XRL-DINE Usage)

To understand how participants of the treatment group use XRL-DINE, we analyze which parts of the dashboard they use for solving the tasks (RQ4a) and whether they face particular problems in using XRL-DINE (RQ4b).

Questions. Concerning RQ4a, we ask “Which part(s) of the Dashboard helped you answer the previous question?” after the completion of each task from RQ1. Participants can give as answer one or several parts of the XRL-DINE dashboard or can also select “None” if they did not use any dashboard parts (e.g., because they performed educated guessing).

Concerning RQ4b, we ask open-ended questions at the end of the online questionnaire on how far participants faced problems regarding the XRL-DINE dashboard. This may help spot problems related to user-interaction issues concerning how the XRL-DINE dashboard was designed.

7 User Study Execution

Below, we introduce the technical setup of the user study, provide the overall structure of the online questionnaire, explain the user study procedure and provide descriptive statistics about the participants.

7.1 Technical Setup

We selected the survey provider Limesurvey after comparing multiple survey providers.⁶ In comparison to other survey tools, Limesurvey offers the possibility to measure the time the participants spent on each question, which we need to compute the participants’ efficiency (RQ1). The time spent for performing each task includes (1) reading the task, (2) finding the solution, and (3) submitting the solution. Before facing each task, a disclaimer text makes the participants aware of the upcoming task and time measurement. After finishing the task, the participants were informed that time measurement had stopped.

The tasks of the user study were to be performed in the context of the SWIM exemplar (see Section 5). To this end, participants of the treatment group were provided with access to the interactive XRL-DINE dashboard instantiated for this exemplar. The control group was provided access to a reduced version of this dashboard, where all DINEs were removed and only the chosen actions, states, and rewards were shown. The reduced dashboard only consists of parts (1)–(3), while parts (4) and (5) have been removed (see Figure 10). Both dashboard versions are part of the supplementary material (see https://git.uni-due.de/rl4sas/xrl-dine).

7.2 Online Questionnaire

7.2.1 Overall Structure.

We have structured the online questionnaire into six main parts. In addition to parts that are directly related to the research questions from Section 6, we have added parts that provide additional information to the participants and also inquire about additional background information from the participants (such as the participants’ experience and demographics).

The control group receives the same questionnaire with the exception of the questions related to RQ3 and RQ4, because these questions concern XRL-DINE.

Before participating in the study, participants are informed about the study purpose and procedure (see Section 7.3). They are also informed that their participation is voluntary and they could stop participating at any time.

Part 1. A welcoming-page introduces the study participants to the context of the study. The participants are asked to adopt the role of a software engineer who seeks to understand the decisions made by a Deep Online RL agent.

After this introduction, the participants are asked to rate their experience in ML and RL, as well as with self-adaptive systems, cloud computing and XRL-DINE. The answers are to be given on a 4-point scale: no experience, some experience, medium experience, and high experience.

Part 2. The participants are introduced to the SWIM exemplar and XRL-DINE (treatment group only). The explanatory text presented in this part is available to the participants throughout the whole survey for reference. The participants are given the following pieces of information:

—

An explanation of the SWIM exemplar;

—

A list of the sub-goals, among which the RL agent should seek a tradeoff;

—

A list of actions available to the RL agent together with examples of typical effects of these actions.

The underlying reward function (as introduced in Section 5.1) is not given to the participants of the treatment group, so the participants had to perform the tasks— particularly the ones in Task Group 3—solely using XRL-DINE. In contrast, the control group receives the reward function as part of the adaptive system exemplar description text, while they only have access to a reduced version of the XRL-DINE dashboard. Providing the reward function to the control group increases the chance that participants can find the right answer and thus creates a more challenging baseline.

After this introduction to the exemplar, participants are asked the following multiple-choice control questions (CQ) to check to which degree participants “understood” the information provided and thereby have grasped the necessary context to perform the tasks that follow.

—

CQ1—What are the goals of the RL agent?

—

CQ2—Which actions can be performed by the RL agent in one-time step?

—

CQ3—In which part of the dashboard can you find the selected action of the RL agent?

—

CQ4 (treatment group only)—How can you tell using XRL-DINE that the RL agent's decision is certain?

Part 3. This part of the study focuses on measuring task performance (RQ1) by asking concrete questions about the decision-making of Online Deep RL (see Section 6.3). After completing each task, participants are asked to assess their confidence in solving the previous task correctly (RQ2). Additionally, they are asked to indicate which part(s) of the XRL-DINE dashboard they used to solve the task (RQ4). A multiple-choice list, including each part of the dashboard, is given.

Part 4. This part is about questions related to the perceived usefulness and ease of use of DINE (RQ3). Accordingly, this part is only shown to participants of the treatment group.

Part 5. This part serves to assess in how far participants faced problems during the study. In particular, participants are asked open-ended questions about XRL-DINE and the XRL-DINE dashboard (RQ4). Participants were also encouraged to provide ideas on how to improve the XRL-DINE dashboard. As part 4, this part is also only shown to participants of the treatment group.

Part 6. This final part collects further demographic data from participants, including gender, age group, highest degree, and job.

7.2.2 Pre-Testing.

We conducted a pre-test of the questionnaire involving five researchers from our institute. Pre-testing included (1) proof-reading of the text, i.e., scenario and dashboard description, intro, and questions, (2) answering the questions and working on the tasks while thinking out loud, (3) observing how the XRL-DINE dashboard was used, and (4) a concluding open discussion.

Pre-testing led to improvements in the description texts to improve understanding of the scenario and dashboard parts as well as to improve readability. Questions concerning the research questions were refined to make them better fit the context and made less ambiguous. Pre-testing also allowed us to determine an approximate completion time for the survey.

7.3 Study Procedure

We considered the following main aspects of our study procedure.

Participants only have one try to answer a question and cannot change an already given answer. Technically, after submitting the answer, the participants are forwarded to the next question, and they cannot go back to previously answered questions. We chose to do so to prevent participants from correcting a previously given answer using newly gained insights, and thereby skew the results due to learning effects. Also, participants have to finish the questionnaire in one session. Thereby, we avoid confounding effects due to the time needed for mental context shifts.

The participants can take part in the user study at any time and place they want. The participants are not supervised, and the questionnaire includes all necessary information for the participants. Consequently, we do not offer the participants to ask clarification questions in order to avoid unfair advantages over other participants.

We executed the user study with a time frame of 30 days for both the treatment and control groups. All participants were directly contacted by the second author (via e-mail or in person). As participants, we approached software engineers from academia and industry from the second author's contact network according to the criteria defined for a “human-grounded evaluation” in [13]. In particular, this means that as we perform simplified tasks, we could approach participants who were knowledgeable in software engineering but not necessarily had to be experts in the fields covered by our study.

For the treatment group, we collected \(N=54\) fully completed questionnaires and 23 partially completed questionnaires. For the control group, we collected \(N=19\) fully completed questionnaires and 12 partially completed questionnaires. For further analysis, only the completed questionnaires were taken into account. The raw survey results are provided as supplementary material (see Section 1.4).

7.4 Descriptive Statistics

7.4.1 Demographics.

Table 1 provides an overview of the demographics of the study participants. Most participants in the treatment group hold a Bachelor's or Master's degree (76%). While we asked both people from academia and industry to participate in the study, most participants can be assigned to the group academia, including researchers and students (82%). Similar to the treatment group, most participants of the control group hold a Bachelor's or Master's degree (84%). Also, regarding the participants’ jobs, the proportion of bachelor students (16%) and participants working in industry (11%) is similar. While the control group consists of slightly more students in the master's program and slightly less researchers than the treatment group, overall the demographics of the two groups are comparable.

Table 1.

Gender	Male	Female	Diverse	Prefer not to say
	81.48%	9.26%	0.00%	9.26%
	84.21%	10.53%	5.26%	0.00%
Age	20–29	30–39	40–49	50–59	60–69
	57.41%	24.07%	7.41%	7.41%	3.7%
	89.47%	10.53%	0.00%	0.00%	0.00%
Degree	High school graduate	Bachelor's degree	Master's degree	PhD	Prefer not to say
	11.11%	35.19%	40.74%	11.11%	1.85%
	10.53%	57.89%	26.32%	0.00%	5.26%
Job	Bachelor Student	Master Student	Researcher	Industry	Prefer not to say
	16.67%	27.78%	37.04%	11.11%	7.4%
	15.79%	52.63%	21.05%	10.53%	0.00%
End Device	Computer	Laptop	Smartphone	Tablet	Other
	27.78%	64.81%	5.56%	1.85%	0.00%
	10.53%	78.95%	0.00%	5.26%	5.26%

Table 1. Demographics of the Study Participants

Control group shown in italics.

7.4.2 Experience.

The experience of participants is shown in Table 2. Taking into account the uncertainty of self-assessing one's own experience, the data for the participants shows that both the treatment and control groups represent a comparable cross-section of participants. In particular, the self-assessment of the participants’ own experience in the different categories shows that the participants assess their knowledge similarly.

Table 2.

	No Experience	Some Experience	Medium Experience	High Experience
ML	11%	46%	26%	17%
ML	0%	47%	37%	16%
RL	28%	52%	18%	2%
RL	32%	47%	5%	16%
Self-Adaptive Systems	28%	54%	13%	5%
Self-Adaptive Systems	31%	37%	21%	11%
Cloud Computing	22%	30%	33%	15%
Cloud Computing	26%	37%	21%	16%
XRL-DINE Approach	94%	4%	2%	0%
XRL-DINE Approach	95%	5%	0%	0%

Table 2. Experience of the Study Participants

Control group shown in italics.

7.4.3 CQs.

Table 3 provides the results for the CQs. Comparing the results between the treatment and control groups for CQs 1–3, we can observe a comparable, relatively high rate of correct answers of around 80%. Thereby, this provides a similarly good foundation for participants to perform the actual tasks related to understanding the decision making of Online Deep RL. CQ 4 was only posed to the treatment group and was answered 67% correctly.

We can see that there is a good number of participants from both the treatment and control groups that answered all questions correctly. We use this insight to also analyze the results of these treatment and control subgroups as part of RQ1 and RQ2.

Table 3.

CQ 1	CQ 2	CQ 3	CQ 4	All questions correct
78% correct	83% correct	87% correct	67% correct	41% (22/54)
74% correct	74% correct	89% correct	n.a.	63% (12/19)

Table 3. Results for CQs

Control group shown in italics.

8 User Study Results

This section presents the results of our user study structured along the four research questions, followed by a discussion of validity risks.

8.1 RQ1 Results (Task Performance)

Table 4 shows the participants’ mean performance per task, per task group, and across all tasks. It also shows the results of the Mann–Whitney U tests to measure whether a significant difference between the treatment group's and the control group's performance exists. To compute the difference between the results of the treatment and control group for the variable “Number of correct answers” using the Mann–Whitney U test, we use the number of correct tasks for each participant of the treatment group as the first input variable and the number of correct tasks for each participant of the control group as the second input variable. For computing the difference between the treatment and control group for the variable “Mean time [Min:Sec]” using the Mann–Whitney U test, the input variables are the mean time the participants of the treatment or control group needed to solve all tasks.

Table 4.

	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8	Mean	Treatment Group and Control Group Difference
	Task Group 1		Task Group 2			Task Group 3			Mean	Treatment Group and Control Group Difference
Number of correct answers	35	51	49	33	34	43	37	45	40.9	U = 147.5
	35	51	49	33	34	43	37	45	40.9	p \({ \lt }\). 001
	18	19	2	2	3	5	4	15	8.5	r = 0.55
Mean time [Min:Sec]	02:00	00:47	00:56	01:59	01:30	01:05	00:46	01:16	1:17	U = 411
	02:00	00:47	00:56	01:59	01:30	01:05	00:46	01:16	1:17	p =. 203
	01:15	00:32	01:22	01:48	01:09	00:57	01:05	00:38	1:06	r = 0.15
Effectiveness	65%	94%	91%	61%	63%	80%	69%	83%	76%	U = 147.5 p \({ \lt }\). 001 r = 0.55
	80%		72%			77%			76%
	95%	100%	11%	11%	16%	26%	21%	79%	45%
	97%		12%			42%			45%
Efficiency	0.32	1.21	0.97	0.31	0.42	0.73	0.89	0.66	0.67	U = 445 p =. 397 r = 0.1
	0.77		0.57			0.76			0.67
	0.75	1.90	0.08	0.06	0.14	0.28	0.19	1.25	0.63
	1.33		0.09			0.57			0.63

Table 4. Mean Performance of Treatment Group (\(N=54\)) and Control Group (\(N=19\); shown in italics)

Treatment and control group differences across all tasks are given as results of the Mann–Whitney U test; significant results are highlighted in bold.

To compute whether the overall effectiveness of the treatment and control group differs significantly using the Mann–Whitney U test, we use the percentage of correct tasks out of all tasks that each participant of the treatment and control group achieved. A detailed comparison of how the effectiveness per task and per task group differs is described in Section 8.1.1. Finally, to compute the difference between the treatment group and control group participants’ efficiency results, we calculated the efficiency of each participant as described in Section 6.3. In addition, we used the efficiency of each treatment group participant as the first input variable and the efficiency of each control group participant as the second input variable for the Mann–Whitney U test. A detailed comparison of how the efficiency per task and per task group differs is described in Section 8.1.2.

Table 4 is complemented by Figure 15 showing boxplots to visualize the spread of performance between the two groups of participants, also showing results for only those participants who answered the CQs correctly. We discuss the results separately for effectiveness (RQ1a) and efficiency (RQ1b) below.

Fig. 15.

8.1.1 Results for Effectiveness (RQ1a).

Across all eight tasks, the treatment group achieved an effectiveness of 76%, which means that they correctly performed more than \(3/4\) of all tasks. Of all 54 participants in the treatment group, 33% were able to solve all eight tasks correctly, while only 2% failed to correctly answer any of the tasks. This means that there were few outliers who could not solve any of the tasks, while the majority could solve most tasks with the help of XRL-DINE. In comparison, the control group performed worse, achieving an effectiveness of 45%. None of the participants of the control group was able to solve all tasks correctly. The best participants were able to solve five out of the eight tasks correctly. Still, all participants solved at least one task correctly.

Table 5 shows the results of the Mann–Whitney U tests performed to compare the participants’ effectiveness results for each task and for each task group. To compare the participants’ effectiveness for each task group, we calculated the treatment and control group participants’ percentage of correct tasks in the respective task group. For comparing the participants’ effectiveness for each individual task, we used 1 or 0 (i.e., correct ir incorrect) as input variables based on whether the participants solved the respective task correctly or not.

Table 5.

Task Group 1		Task Group 2			Task Group 3
Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8
U = 359	U = 484.5	*U = 101.5*	*U = 253.5*	*U = 271*	*U = 239.5*	*U = 269.5*	U = 490.5
p =. 055	p =. 726	*p \({ \lt }\). 001*	*p =. 001*	*p =. 002*	*p =. 001*	*p =. 002*	p =. 783
r = 0.29	r = 0.12	*r = 0.76*	*r = 0.44*	*r = 0.41*	*r = 0.49*	*r = 0.42*	r = 0.05
U = 358		*U = 100*			*U = 207*
p =. 053		*p \({ \lt }\). 001*			*p \({ \lt }\). 001*
r = 0.29		*r = 0.64*			*r = 0.48*

Table 5. Results of the Mann–Whitney U Test Comparing the Effectiveness of the Treatment Group and Control Group

Significant results highlighted in bold.

For the individual tasks, results of the Mann–Whitney U tests show that for tasks 3 to 7, the difference between the treatment group participants’ effectiveness and the control group participants’ effectiveness is significant. The biggest difference exists for Task 3. For Task 1, Task 2, and Task 8 no significant difference between the treatment and control group's results exist.

When analyzing the results for the different task groups, the treatment group achieved similar results for all three task groups, while the control group shows considerable worse performance for Task Groups 2 and 3 when compared to Task Group 1. While participants of the control group performed better in Task Group 1 than the treatment group,⁷ they performed significantly worse than the treatment group in Task Groups 2 and 3. Especially for the questions from Task Group 2, where the control group participants could only answer based on their domain knowledge (= educated guessing), the participants performed worst (with effectiveness of only 12%).

Considering the effectiveness across all tasks, the results of the Mann–Whitney U tests show that the difference between the treatment and control group's effectiveness is statistically significant, \(U=147.5,p \lt .001,r=0.55\). When only considering results from participants who solved all CQs correctly, we can observe an even more significant difference with \(U=13,p \lt .001,r=0.75\). Thereby, we can support our hypothesis that the effectiveness of the treatment group is significantly better than the effectiveness of the control group, which suggests that XRL-DINE indeed helps to understand the decision making of Online Deep RL.

8.1.2 Results for Efficiency (RQ1b).

The mean efficiency of participants of the treatment group was 0.67, which means that the participants solved 0.67 tasks per minute correctly. The highest efficiency in the treatment group was achieved by a participant who completed all tasks correctly in under five minutes (efficiency = 1.7). The lowest efficiency equals 0, since some participants could not solve any of the tasks correctly. Looking at all three task groups, the participants achieved similar efficiency for Task Group 1 (0.77) and Task Group 3 (0.76). The participants’ efficiency in solving tasks from Task Group 2 is noticeably worse (0.57). This is because the participants took considerably longer to find a solution to Tasks 4 and 5, which reduces the participants’ efficiency in solving tasks from Task Group 2. Thus, identifying why the agent made its decision instead of an alternative decision may not be as straightforward as identifying the main-/sub-goal(s) of the online RL agent or identifying why the adaptive system is in a specific state. The participants of the treatment group took around 01:17 minutes for each task, resulting in a total mean of 10:19 minutes for all eight tasks.

Participants of the control group achieved a mean efficiency of 0.63. Comparing the results for each task group, the control group achieved the highest efficiency in Task Group 1 (1.33), followed by Task Group 3 (0.57) and Task Group 2 (0.09). In contrast to the results of the treatment group, the control group was able to outperform the treatment group for Task Group 1, achieved slightly worse results for Task Group 3, and performed very poorly for Task Group 2. As already mentioned, the worse results for Task Groups 2 and 3 are due to the participants of the control group having no realistic chance to identify the correct answer without access to the explanations of XRL-DINE. The participants of the control group took around 01:06 minutes for each task, resulting in a total mean of 08:46 minutes for all eight tasks, i.e., the control group was, on average, around 1.5 minutes faster. This gives one reason for why the efficiency results of the treatment and control group are close to each other. While the participants of the control group delivered a much lower rate of correct answers, they did so in a shorter time.

Table 6 shows the results of the Mann–Whitney U tests, which compare the participants’ efficiency results for each task and for each task group. As input for the test of each task group, we used the correctly performed tasks of each group divided by the overall time used to perform all tasks in the respective task group (in minutes). For comparing the participants’ efficiency for each individual task solved correctly we divided one by the time each participant needed to solve the respective task (in minutes). If the participant did not solve the a task correctly, the efficiency is always zero.

Table 6.

Task Group 1		Task Group 2			Task Group 3
Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8
*U = 264.5*	*U = 295.5*	*U = 120.5*	*U = 246.5*	*U = 278*	*U = 229*	*U = 273*	*U = 240*
*p =. 001*	*p =. 007*	*p \({ \lt }\). 001*	*p =. 001*	*p =. 003*	*p \({ \lt }\). 001*	*p =. 003*	*p =. 001*
*r = 0.40*	*r = 0.32*	*r = 0.59*	*r = 0.42*	*r = 0.37*	*r = 0.43*	*r = 0.37*	*r = 0.40*
*U = 247.5*		*U = 141*			U = 488.5
*p =. 001*		*p \({ \lt }\). 001*			p =. 763
*r = 0.39*		*r = 0.55*			r = 0.04

Table 6. Results of the Mann–Whitney U Test Comparing the Efficiency of the Treatment Group and Control Group

Significant results are highlighted in bold.

For the individual tasks, the results of the Mann–Whitney U tests show a significant difference between the treatment group participants’ efficiency and the control group participants’ efficiency for all tasks. For Task Group 1, the control group participants’ efficiency is significantly better compared to the treatment group participants’ efficiency. A potential reason might be that the participants take their time to understand the XRL-DINE dashboard in detail before starting to solve the tasks. Getting to know the dashboard might effect especially the first task for which the participants use the dashboard for the first time. Since the control group participants used a reduced version of the XRL-DINE dashboard, which does not provide explanations, the control group might have taken less time to understand the less complex information displayed in the dashboard. However, these are assumptions that are not based on data from the user study.

For Task Group 2 the treatment group participants achieved a significantly higher efficiency compared to the control group. We assume that having access to the explanations of XRL-DINE is the reason for this.

Participants of the treatment group achieve a significantly higher efficiency for Task 6 and Task 7, while the participants of the control group achieved a significantly higher efficiency for Task 8. Having access to the reward function, which directly shows the weights of the decomposed reward channels might be an advantage for the control group when solving Task 8, which was solved significantly faster by the control group. We assume that the difference is caused by the treatment group participants not knowing the reward function, i.e., the treatment group can only guess the weights of the reward channels by analyzing the information provided by the XRL-DINE dashboard, which of course takes some time. However, this is just an assumption. We do not have any data from the user study to check or verify this assumption. Comparing Task Group 3 as a whole, no significant difference between the treatment and control group participants’ efficiency exists.

The Mann–Whitney U test across all tasks shows that the difference between the treatment and control group's efficiency is not statistically significant, neither when considering all participants (\(U=445,p=.397,r=0.1\)), nor when only taking into account participants that solved all CQs correctly (\(U=127,5,p=.886,r=0.03\)). This means the data do not allow us to draw a conclusion concerning our hypothesis that the efficiency of the treatment group is significantly better than the efficiency of the control group.

8.2 RQ2 Results (Self-Assessed Confidence)

Below, we first present the participants’ self-assessed confidence for each task (RQ2a), and then analyze the correlation between the participants’ self-assessed confidence and their task performance (RQ2b).

8.2.1 Results for Self-Assessed Confidence (RQ2a).

Table 7 shows the mean results for the confidence self-assessment of the treatment and control group for each task. To quantify the confidence self-assessment of the treatment and control group, we mapped the confidence levels to numerical values (as shown in brackets after the confidence label).

Table 7.

	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8	Mean
	Task Group 1		Task Group 2			Task Group 3			Mean
Very unsure (1)	17%	2%	9%	15%	13%	9%	9%	33%	13%
	9.5%		12.3%			17%			13%
	0%	0%	74%	47%	42%	47%	47%	5%	33%
	0%		54.3%			33%			33%
A little unsure (2)	15%	19%	17%	19%	15%	19%	13%	28%	18%
	17%		17%			20%			18%
	5%	0%	16%	21%	26%	16%	32%	32%	18%
	2.5%		21%			26.6%			18%
Mostly confident (3)	24%	33%	30%	26%	30%	28%	28%	28%	28%
	28.5%		28.6%			28%			28%
	11%	16%	5%	21%	16%	11%	5%	11%	12%
	13.5%		14%			9%			12%
Very confident (4)	44%	46%	44%	41%	43%	44%	50%	11%	41%
	45%		42.6%			35%			41%
	84%	84%	5%	11%	16%	26%	16%	53%	37%
	84%		10.6%			31.6%			37%

Table 7. Confidence Self-Assessment for the Treatment Group and Control Group (Shown in Italics)

In the treatment group, 69% of the participants expressed that they were confident in solving the tasks correctly. Considering the numerical values, this maps to a mean confidence score of 2.96. When only taking into account participants who solved all CQs correctly, the confidence score rises to a mean value of 3.23. No clear trend can be identified for the control group (49% confident, 51% unsure). Their mean confidence score equals 2.53. Again, taking only participants into account who solved the CQs correctly, the confidence score is 2.74.

Regarding Task Group 1, most participants of both the treatment and control groups assessed their confidence in solving the tasks correctly as “very confident”. In comparison, one can identify that the confidence level of participants of both the treatment and control groups decreases for tasks of Task Group 2 and 3. Especially for Task Group 2, participants of the control group assessed that they were unsure whether they solved the task correctly (\(\approx 75\%\) unsure). More than 70% of the treatment group are still confident. Also for tasks from Task Group 3, the treatment group is more confident than the control group (63% treatment group vs. 40.6% control group).

Table 8 shows the results of Mann–Whitney U tests comparing the self-assessed confidence of the treatment group and control group participants. As input for the Mann–Whitney U tests we use the numerical values as already introduced. The values either represent the participants’ confidence level for each task, the confidence level for the respective tasks in each task group, or the confidence level across all tasks.

On the level of individual tasks, the participants’ confidence differs significantly for all tasks between the treatment and control groups. However, based on the user study results, we cannot identify a reason for why the participants of the treatment group report a significantly higher confidence for tasks 2, 4, 5, and 7 and why the participants of the control group report a significantly higher confidence for Task 1 and Task 8.

Table 8.

Task Group 1		Task Group 2			Task Group 3			Treatment Group and Control Group Difference
Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Task 7	Task 8	Treatment Group and Control Group Difference
*U = 294*	*U = 127.5*	*U = 302*	*U = 274*	*U = 277.5*	*U = 307.5*	*U = 214.5*	*U = 273*	*U = 330* *p = .022* *r = 0.27*
*p =. 006*	*p \({ \lt }\). 001*	*p =. 008*	*p =. 003*	*p =. 003*	*p =. 01*	*p \({ \lt }\). 001*	*p =. 003*
*r = 0.36*	*r = 0.59*	*r = 0.35*	*r = 0.36*	*r = 0.36*	*r = 0.32*	*r = 0.46*	*r = 0.37*
U = 365.5		*U = 344*			U = 371
p =. 065		*p =. 035*			p =. 076
r = 0.22		*r = 0.25*			r = 0.21

Table 8. Results of the Mann–Whitney U Test Comparing the Self-Assessed Confidence of the Treatment Group and Control Group

Significant results are highlighted in bold.

When comparing the participants’ self-assessed confidence for each task group, a significant difference exists only for Task Group 2. Here, participants of the treatment group reported significantly higher confidence. However, the effect is small (\(r=0.25\)).

The Mann–Whitney U test across all tasks shows that the difference between the self-assessed confidence results of both the treatment and control group was statistically significant, \(U=330,p=.022,r=0.27\). Taking into account participants who solved all control questions correctly, the difference is also significant, and the effect size increases, \(U=71.5,p=.032,r=0.38\). Thereby, we can support our hypothesis that the self-assessed confidence of the treatment group is significantly higher than the self-assessed confidence of the control group, suggesting that the explanations of XRL-DINE foster self-assessed confidence in comparison to (educated) guessing.

8.2.2 Results for Correlation with Effectiveness (RQ2b).

For the treatment group, we can observe a significant correlation between self-assessed confidence and effectiveness (\(r=0.53,p \lt .001\)). Yet, for the control group, there is no significant correlation between self-assessed confidence and effectiveness (\(r=0.17,p=.499\)). Thereby, we can only partly support our hypothesis that there is a significant correlation between self-assessed confidence and task performance. The results suggest that while the explanations of XRL-DINE may facilitate the correct self-assessment of one's own solution, the control group participants had problems in assessing the correctness of their answers due to the required educated guessing (e.g., several participants were falsely convinced that they had solved the task correctly).

8.3 RQ3 Results (Perceived Usefulness and Ease of Use)

Figure 16 shows the overall results for the perceived usefulness and ease of use as stacked bar charts. As can be seen, the participants rated their perception regarding the usefulness and ease of use of the XRL-DINE dashboard positively, i.e., 64% of all statements regarding the participants’ perceived usefulness was given a rating of Quite Likely or Extremely Likely. Only 15% perceive XRL-DINE as not being useful. Perceived ease of use came off slightly worse than perceived usefulness, but overall the positive ratings outweigh the negative ones here as well (53% positive vs. 28% negative).

Fig. 16.

Figure 17 shows a breakdown per individual TAM question as bar charts. Each bar provides the relative number of answers to each of the TAM questions. As can be seen, all questions were rated with at least 48% of positive votes. The best results were achieved for the statement “Using the XRL-DINE Dashboard would make it easier to do my job.” (72% positive ratings vs. 15% negative ratings).

Yet, more than a quarter of all participants stated challenges when using XRL-DINE, i.e., issues with understanding the visualization and problems getting the dashboard to do what they wanted. Also, almost a third of all participants had problems to learn how to operate the XRL-DINE dashboard (31% negative ratings for the respective statement). Hence, although the perceived usability of XRL-DINE has been positive on average, there is potential for improvement, which is discussed together with the feedback concerning the XRL-DINE usage for RQ4 below.

Fig. 17.

8.4 RQ4 Results (XRL-DINE Usage)

Below, we first present which XRL-DINE dashboard the participants used (RQ4a) and then report what problems the participants mentioned when using the XRL-DINE dashboard (RQ4b).

8.4.1 Results for XLR-DINE Usage (RQ4a).

Table 9 shows the three most-used parts of the XRL-DINE dashboard as reported by the participants in the treatment group.

Table 9.

	Top1	Top2	Top3
Task 1	Trajectory of selected actions (89%)	Uncertain Action DINE (11%)	Reward Chart (6%)
Task 2	Trajectory of selected actions (93%)	State Progression (7%)	Reward Chart (6%)
Task 3	Uncertain Action DINE (85%)	Trajectory of selected actions (28%)	Reward Channel Extremum DINE (4%)
Task 4	Contrastive Action DINE (46%)	Uncertain Action DINE (28%)	Trajectory of selected actions (20%)
Task 5	Contrastive Action DINE (50%)	Uncertain Action DINE (35%)	Relative Reward Channel Dominance DINE (15%)
Task 6	Contrastive Action DINE (46%)	Uncertain Action DINE (31%)	Relative Reward Channel Dominance DINE (24%)
Task 7	Contrastive Action DINE (50%)	Uncertain Action DINE (33%)	Relative Reward Channel Dominance DINE (20%)
Task 8	Reward Chart (52%)	Uncertain Action DINE (26%)	Reward Channel Extremum DINE (19%)

Table 9. Top Three Results for the Dashboard Part Usage of the Treatment Group

For Task Group 1, the trajectory of selected actions is the most used dashboard part (Top1 for Tasks 1 and 2). In addition, participants reported they also used the Uncertain Action DINE, the State Progression and the Reward Chart. For Task Group 2, the participants stated they mainly used the Uncertain Action DINE (Top1 for Task 3; Top2 for Tasks 4 and 5), followed by the Contrastive Action DINE (Top1 for Tasks 4 and 5). Besides these two DINES, the participants assessed they used the trajectory of selected actions as well as Reward Channel Extremum and Relative Reward Channel Dominance DINEs. For Task Group 3, results show participants used the Contrastive Action DINE (Top1 for Tasks 6 and 7) and the Reward Chart (Top1 for Task 8). Additionally, the Uncertain Action DINE was used for all three tasks from Task Group 3 (Top2).

The results show that depending on the nature of task, participants choose different dashboard parts. They use Contrastive Action DINEs especially to solve tasks including “why”-questions (Tasks 4 and 5) and questions asking for a contrast (Tasks 6 and 7). To identify adaptations and actions, the trajectory of selected actions was used (Tasks 1 and 2). To determine whether the agent is uncertain when making a decision, the Uncertain Action DINE was chosen most often (Task 3). Finally, to identify the main goal of the agent, the Reward Chart was mostly used (Task 8).

8.4.2 Problems with XLR-DINE Dashboard (RQ4b).

Below, we present relevant problems reported by the participants about the usage of XRL-DINE and selected suggestions on what kind of additional explanations would help the study participants to understand the decision-making of the Online Deep RL agent.

Only around a quarter of the participants stated they had problems with the XRL-DINE dashboard. The main problems mentioned were confusing graphs and lack of visual optimization. For example, one participant mentioned that “with all the lines in the state progression graph, [he is] not sure what information to use […] as [he is] not sure how to interpret the data.” Another participant criticized that “some contrasting important interactions [are] yellow and some [are] red. The connection to the three goals is not immediately clear, and the user might think yellow means danger and red means failure.”

Suggestions for additional explanations included the following. One participant stated: “An explanation of the RL agent's action should include what action was chosen and for what reason. If possible, the reason should not only be technical, but also indicate the impact on the business goals. For example: The action “Add server” was chosen because the latency was higher than the defined value x. Above this value, a decrease in customer satisfaction is to be expected.” Another participant asked for a visualization of the states and agent interactions. Moreover, many participants asked for contrastive explanations, such as “If it chooses to lower the dimmer value, I want to know why it didn’t add another server, e.g.,[…]. Same goes for the other way around as well as similar decisions when scaling up/increasing the dimmer value.”

8.5 Discussion of Overall Results

The results of the user study show that the study participants achieved significantly higher effectiveness and significantly higher self-assessed confidence across all tasks when having access to explanations given by XRL-DINE. Regarding the treatment and control group participants’ efficiency, there is no significant difference. When comparing the different task groups, the participants of the treatment group achieved significantly higher effectiveness in Task Group 2 (“why” questions) and Task Group 3 (“which goal” questions) compared to the control group participants. Regarding efficiency and self-assessed confidence, the treatment group participants achieve significantly higher values for Task Group 2. We assume that having access to the explanations of XRL-DINE causes this difference.

For Task Group 1, there was no significant difference in the effectiveness. However, the control group participants achieved a significantly higher efficiency compared to the treatment group. Theoretically, it should be possible to answer questions of Task Group 1 equally with the information provided by the XRL-DINE dashboard (treatment group) and the reduced dashboard used by the control group. A potential reason for the significantly higher efficiency of the control group participants might be that they required less time getting to know the reduced XRL-DINE dashboard functions compared to the full XRL-DINE dashboard functions used by the treatment group. Thus, the control group participants might have started solving the tasks earlier and therefore achieved a higher efficiency. However, we cannot verify this assumption based on the data of the user study.

All participants perceive XRL-DINE to be easy to use and useful. Participants suggested improvements in particular to the presentation of the explanations. We thus assume that explanations contribute positively to the ability of the participants to follow the RL agent's decisions. Especially when answering why-questions (i.e., solving tasks from Task Group 2), the use of XRL-DINE's explanations by the treatment group leads to significantly higher effectiveness, efficiency, and self-assessed confidence when answering compared to not having access to explanations (participants of the control group).

8.6 Validity Risks

Our user study inherits the typical validity risks of user studies on task performance and assessing explanations (e.g., see [8, 47]). We discuss them along with the different types of risks below.

Internal Validity. There is a risk that participants may not seriously participate or may try to over-perform. To address this risk, we designed the study as an online questionnaire to be conducted within 30–45 minutes (e.g., by selecting a concrete scenario of manageable size), which Daun et al. [8] assume to be an adequate time to reduce the risk participants lose interest during participation. Also, we did not offer the option of pausing and returning to the questionnaire to avoid participants spend arbitrary long times for answering the questions.

Since the participants were asked to participate directly by the second author, social pressure could be involved, which is associated with the risk of biased results. To minimise this risk, the study was conducted anonymously. In addition, the online questionnaire format allowed participants to participate in the study at a time and place of their choice. As a result, the authors of this paper are unable to track who actually participated and how well each person performed.

Another risk is that the participants self-assessed experience (see Section 7.4) is affected by the the Dunning-Kruger effect [28], where participants with lower experience in a given area may overestimate their abilities. We took into account this uncertainty when establishing that the experiences of the treatment and control group are comparable, and also did not use the self-assessed experience for any further analyses which may have been affected by this uncertainty.

Concerning the answer choices along a scale (e.g., in RQ2 and RQ3), we used verbal scale labels instead of numeric labels to provide a concise meaning of the choices. We also started with the negative choice first, as studies have shown that people tend to select the first option that fits within their range of opinion. By doing so, our study thus leads to more “conservative” results (cf. [37]).

Construct Validity. We designed the survey in such a way that it provides participants of treatment and control group with a detailed introduction and thus equal grounding for answering the questions. As we explained in Section 7.2, the control group did not have access to XRL-DINE, but was given the reward function instead (which the treatment group did not see), thereby providing a more challenging baseline to compete against.

One evident risk is the potential impact of the GUI design of the XRL-DINE dashboard and the given descriptive texts on the study results. Our study especially evaluates XRL-DINE, including the given XRL-DINE dashboard, as the way to transfer the explanations from the XRL-DINE engine to the human. Thus, we cannot distinguish between the effect of the explanations and the explanation transfer (i.e., visualization). This would require expanding our user study with evaluation approaches from human-computer interaction, for instance.

The participants’ eight tasks consisted of answering different concrete questions. To reflect various insights into the decision-making of Deep RL, we covered different typical types of questions: “what/which,” “why,” and “how many.” Still, these questions only represent a subset of possible questions.

Another risk is the differences in size, demographics and experience of the treatment and control group. Controlling demographics by firstly collecting data around demographics and experience to secondly splitting the participants into two even groups, obtaining a more even distribution between treatment and control group was not possible because the control group study was performed separately to address reviewers’ feedback. We therefore used statistical tests to support the comparison of results of unequal sample sizes [32]. Also, we carefully selected the participants of the control and treatment groups to have comparable demographics and assessed that they have comparable experience levels as suggested by Doshi-Velez and Kim [13]. In addition, we made sure that the participants of the treatment and control group were disjoint to eliminate learning effects. Finally, we statistically measured the significance of the comparative results to determine in how far one can draw conclusions from the differences in results.

External Validity. To strengthen external validity, we used an actual adaptive system exemplar (SWIM) and trained the RL agent realizing SWIM's adaptation logic using real-world workload traces. We tuned the hyperparameters of the RL algorithm experimentally, using educated guessing (e.g., comparable to [40]). We purposefully did not perform extensive, exhaustive hyperparameter tuning, e.g., using grid search, because our aim was not to improve or compare the performance of existing RL approaches but to validate how XRL-DINE may be used to generate explanations for RL decisions. Still results are only for a single system and a selected scenario of 21 time steps with fixed values for \(\phi\) and \(\rho\) (see reasoning in Section 5.2), which thus limits generalizability.

We used participants with various degrees and job types, providing a cross-section of some typical software engineering personnel. We could measure a positive effect of using XRL-DINE for all participants; however, it appears that we could not measure a statistically significant effect across the different degrees and job types. If we were to generalize our findings for specific types of software engineers, an enlargement of the user study thus would be needed.

9 Enhancements

Below, we discuss potential enhancements of XRL-DINE.

Natural-Language Explanations. The literature distinguishes two major types of explanation formats [31]: (i) visual explanations, including graphical user interfaces, charts, data visualization, or heatmaps, and (ii) natural-language explanations. With the exception of the Contrastive Action DINE (see Section 3.4), XRL-DINE provides visual explanations. Compared with visual explanations, the benefits of natural-language explanations reported in the literature include (1) better understandability for people with diverse backgrounds as well as non-technical users, (2) increased user acceptance and trust, and (3) more efficient explanations [31]. One enhancement of XRL-DINE thus is to provide more of such natural-language explanations. A particular promising direction is to leverage the capabilities of modern AI chatbots (such as ChatGPT), which can provide natural-language answers to any natural-language question posed to them. However, as a downside of this flexibility, the underlying large language model may “hallucinate,” i.e., generate nonsensical text unfaithful to the provided source input. This means AI chatbots may deliver explanations that do not faithfully explain the decision-making of RL, i.e., the explanations may exhibit low fidelity. In addition, AI chatbots may provide different explanations for the very same question asked, i.e., the explanations may exhibit low stability. Careful prompting and hyper-parameter tuning may be one direction to deliver high fidelity and stability of explanations [35].

Explainability in the Presence of Delayed Rewards. Environment dynamics may delay the effects of adaptations. As an example, in the SWIM exemplar (introduced in Section 5) executing the Add Server action requires booting a new server, which takes time. As a result, the effect on latency and thus on rewards for executing this adaptation are delayed. In contrast, lowering the dimmer value has an immediate effect on latency and thus on rewards. Such timing-related differences may make interpreting DINEs more difficult. To address these timing-related differences, XRL-DINE may, for instance, be extended by the RUDDER approach for the decomposition of delayed rewards [2].

Explaining Long Trajectories. While our user study focused on explaining a short trajectory of Deep RL decisions of 21 time steps, XRL-DINE in principle can be used to help explain long trajectories. To this end, the hyper-parameters \(\rho\) and \(\phi\) may be changed in such a way that depending on how long the trajectory is, the number of generated DINEs remains manageable. As we showed in Section 3.6, the hyper-parameters allow tuning the rate of generated DINEs in a wide range. An interesting direction would be to auto-tune these hyper-parameters, having the user prescribe the number of DINEs to be generated. Also, for explaining such long trajectories, other forms of visualizations than what currently is offered in the XRL-DINE dashboard may be preferred; e.g., instead of showing “Reward Channel Extremum” DINEs as overlays to the received reward charts, a mere list of these DINE may be preferable.

Explanations for Decentralized Adaptive Systems. In a decentralized adaptive systems, the adaptation logic is decentralized across multiple systems [14, 52]. As XRL-DINE is built to explain the decisions of a single RL agent, XRL-DINE does not consider the decisions of other RL agents during explanations. Similarly to the aforementioned situations in which XRL-DINE may generate difficult to understand DINEs, the same may happen if XRL-DINE is directly applied to decentralized adaptive systems. Extending XRL-DINE depends on the fundamental, underlying approach of decentralization. On the one hand, such a decentralization may follow different patterns for how the MAPE-K elements are decentralized across systems [52]. On the other hand, there are different technical ways to decentralize the learning and decision-making of Online RL, including Multi-Agent RL [45], hierarchical RL [5], and meta RL [70].

Explanations for Policy-Based RL to Capture Concept Drifts. XRL-DINE works for value-based Deep RL because XRL-DINE needs access to the learned action-value function \(Q(S,A)\). As explained in Section 2.1, value-based RL typically relies on \(\epsilon\)-greedy strategies to address the exploration-exploitation dilemma. This poses the challenge of when and how to increase \(\epsilon\) again in order to capture concept drifts in the system environment. Concept drifts occur due to an evolution of the system environment that leads to a change of the effects of adaptations. As an example, if the physical machines that provide the virtual servers in the SWIM exemplar would be replaced by less powerful but more energy-efficient machines, this would impact on the effect of the Add Server action in terms of latency. Coping with such concept drift in value-based Deep RL would require to observe such concept drift and to increase the exploration rate \(\epsilon\) to learn the changed effect of adaptations. In contrast, policy-based Deep RL [46] can naturally cope with such concept drifts. The fundamental idea of policy-based RL is to directly use and optimize a parameterized stochastic action-selection policy \(\pi_{\theta}(S,A)\) in the form of a deep artificial neural network. Using stochastic action selection from this policy, policy-based RL captures concept drifts of the system environment without the need for software engineers to intervene [36, 48]. However, this poses the question on how to generate DINEs from \(\pi_{\theta}(S,A)\) instead from \(Q(S,A)\).

10 Related Work

While generic explainable RL approaches are discussed in recent overview papers (such as [22, 51]), these do not specifically address adaptive systems. We thus focus the following discussion on solutions that specifically provide explanations for adaptive systems, which can be clustered into the following main categories:

Temporal Graph Models. This category of work uses temporal graph models as central artifact to derive explanations [18, 61]. As proposed by García-Domínguez et al. [18], such a model may be used for so-called forensic self-explanation. This means that the model may be queried via a dedicated query language. In addition, such a model may be used for so-called live self-explanation. Here, one can submit queries to the running system and be presented with a live visualization. The underlying temporal model is kept up to date at run time (i.e., employed as a model at run time). The approach is comparable to the Interestingness Elements described in Section 2 since explanations are generated based on execution traces. However, in contrast to XRL-DINE, interesting interactions must be extracted by manually writing queries using the provided query language. As follow-up work, suggestions for automating the selection of interesting interaction moments are proposed. Ullauri et al. [61] fully automated automate this by using complex-event-processing. While the aim of Ullauri et al. is to select interesting interactions to keep the size of the models at run time manageable, the aim of XRL-DINE is to reduce the cognitive load of developers. Compared to earlier work on temporal graph models for explainable adaptive systems, Ullauri et al. [61] stand out in providing explanations for RL decisions. While Ullauri et al. hint at the possibility of using the model at run-time queries to realize reward decomposition, it differs from XRL-DINE in that the combination of Interestingness Elements and Reward Decomposition is not considered.

Goal Models. This category of work uses goal-based models at run time [3, 64]. Bencomo et al. [3] use higher-level system traces as explanations. Again, this can be considered similar to the idea of Interestingness Elements. Welsh et al. [64] employ a domain-specific language for providing explanations in terms of the satisficement of softgoals. In this regard, explanations are comparable to Reward Decomposition explanations introduced by Juozapaitis et al. [23], as described in Section 3, in that the explanations refer to competing goal dimensions. Other than XRL-DINE, these techniques require making assumptions about the environment dynamics at design time, which can be a source for error due to design time uncertainty [66]. Also, in contrast to XRL-DINE, this category of work does not explicitly consider RL.

Provenance Graphs. Reynolds et al. [53] employ interaction data collected at run time to generate explanations in the form of provenance graphs. A provenance graph contains information and relationships that contribute to the existence of a piece of data. By keeping a history of different versions of the provenance graph, it is possible to determine at run time if and how the model has changed (using model versioning) and who has changed the model and why (using the provenance graph). Provenance graphs can quickly become too complex to be meaningfully interpreted by humans; thus, a dedicated query language was introduced that allows extracting information of interest. Again, in contrast to XRL-DINE, this category of work does not consider RL.

Anomaly Detection. Ziesche et al. [71] suggest using ML to detect anomalous behavior of an adaptive system which may require an explanation. Again, this is similar to the idea of Interestingness Elements, which allows determining relevant points for explanation. In addition, they reduce the typically huge search space of possible reasons for such anomalous behavior by classifying the behavior into classes with similar reasons. Thereby, their approach can be considered a first step towards so-called “self-explainable” systems that autonomously explain behavior that differs from anticipated behavior. Again, in contrast to XRL-DINE, this category of work does not explicitly consider RL.

Explainability as a Tactic. Li et al. [30] take a fundamentally different view on explainability. A formal framework is proposed in which explainability is not provided externally but is considered a concrete tactic of the adaptive system. In uncertain or difficult situations, the adaptive system can ask assistance from a human operator in making a decision rather than acting itself. Specifically, the overall system is modeled as a turn-based, stochastic multiplayer game in which three players participate. These players are (1) the actual self-adaptive system, (2) the environment, and (3) the operator. This game is then analyzed using a probabilistic model checker to determine when the involvement of a human operator is necessary. To prevent the operator from being permanently consulted, there is a cost to using this tactic that must be accounted for by the model checker. In this respect, this approach is similar to XRL-DINE, which aims to reduce the cognitive burden of the human. However, the motivation in XLR-DINE is the limited cognitive ability of the human, while the motivation for Li et al. is the time delay caused by involving an operator. Also, RL is not explicitly considered.

Explainable Online RL. In our previous work [16], we introduced XRL-DINE providing detailed explanations of RL decisions at relevant points in time. XRL-DINE enhances and combines two existing explainable RL techniques from Juozapaitis et al. [23] and Sequeira and Gervasio [58]. The explainable RL techniques were proposed in isolation and not tailored to the needs of explaining adaptive systems. In [16], we thus proposed a combination and adaptation of these generic techniques. In contrast to our previous work, this paper adds the design, execution and analysis of a user study.

11 Conclusion

We introduced XRL-DINE, a technique that helps understand the decision-making of Online Deep RL for adaptive systems. We described the prototypical implementation of XRL-DINE using a state-of-the-art deep RL algorithm, serving as a proof-of-concept. We used an adaptive systems exemplar to demonstrate the use of XRL-DINE and measure indicators for the cognitive load of using XRL-DINE. In particular, we performed a comparative user study involving 73 software engineers from academia and industry. Results show that XRL-DINE helps to correctly perform tasks (treatment group: 76% correctly executed tasks; control group: 45% correctly executed tasks) with a statistically significant difference between the treatment and control group. Moreover, these tasks can be performed in a reasonable amount of time (on average \(0.66\) correct tasks per minute or, in other words, on average, 01:15 minutes to give a correct answer). The analysis of the participants’ self-assessed confidence shows that 69% of the treatment group are confident that they solved a task correctly with a significant difference to the self-assessed confidence of the control group (only 49% confident). Additionally, XRL-DINE is perceived as useful by the majority of participants and considered usable, with indications for future enhancements.

Acknowledgement

We cordially thank the participants of our user study.

Footnotes

For details see https://www.gymlibrary.dev/index.html.

https://pytorch.org/

The interactive XRL-DINE dashboard for this scenario is available via https://git.uni-due.de/rl4sas/xrl-dine.

⁴

Both numerical tests (i.e., Kolmogorov–Smirnov test, Shapiro–Wilk test, Anderson–Darling test) and graphical tests (i.e., Histogram, Quantile—Quantile plot) for normal distribution have shown that the results for both groups of participants are not normally distributed.

⁵

https://datatab.net/statistics-calculator/

⁶

https://www.limesurvey.org/

⁷

Note however, that other than for Task Groups 2 and 3, the difference for Task Group 1 was not statistically significant (\(U=358,p=.053,r=0.29\)).

References

[1]

Iván Alfonso, Kelly Garcés, Harold Castro, and Jordi Cabot. 2021. Self-adaptive architectures in IoT systems: A systematic literature review. J. Internet Serv. Appl. 12, 1 (2021), 14.

Abstract

1 Introduction

1.1 Online Reinforcement Learning for Adaptive Systems

1.2 Explainable Online RL

1.3 Problem Statement and Contributions

1.4 Supplementary Material

1.5 Paper Organization

2 Foundations

2.1 Online RL for Adaptive Systems

2.1.1 RL.

2.1.2 Adaptive Systems.

2.1.3 Online RL for Adaptive Systems.

2.2 Explainable ML

3 The XRL-DINE Technique

3.1 Baseline Techniques and Limitations

3.2 “Reward Channel Dominance” DINE

3.3 “Uncertain Action” DINE

3.4 “Contrastive Action” DINE

3.5 “Reward Channel Extremum” DINE

3.6 XRL-DINE Hyper-Parameters

4 Proof-of-Concept Implementation

4.1 Adaptation Logic

4.2 XRL-DINE Engine

4.3 XRL-DINE Dashboard

5 Adaptive System Exemplar

5.1 Decomposed Reward Function

5.2 Selected Scenario

5.3 Applying XRL-DINE to the Selected Scenario

6 User Study Design

6.1 Research Questions

6.2 Overall Study Design

6.3 RQ1 Design (Task Performance)

6.4 RQ2 Design (Participants’ Confidence)

6.5 RQ3 Design (Perceived Usefulness and Ease of Use)

6.6 RQ4 Design (XRL-DINE Usage)

7 User Study Execution

7.1 Technical Setup

7.2 Online Questionnaire

7.2.1 Overall Structure.

7.2.2 Pre-Testing.

7.3 Study Procedure

7.4 Descriptive Statistics

7.4.1 Demographics.

7.4.2 Experience.

7.4.3 CQs.

8 User Study Results

8.1 RQ1 Results (Task Performance)

8.1.1 Results for Effectiveness (RQ1a).

8.1.2 Results for Efficiency (RQ1b).

8.2 RQ2 Results (Self-Assessed Confidence)

8.2.1 Results for Self-Assessed Confidence (RQ2a).

8.2.2 Results for Correlation with Effectiveness (RQ2b).

8.3 RQ3 Results (Perceived Usefulness and Ease of Use)

8.4 RQ4 Results (XRL-DINE Usage)

8.4.1 Results for XLR-DINE Usage (RQ4a).

8.4.2 Problems with XLR-DINE Dashboard (RQ4b).

8.5 Discussion of Overall Results

8.6 Validity Risks

9 Enhancements

10 Related Work

11 Conclusion

Acknowledgement

Footnotes

References

Index Terms

Recommendations

Data quality issues in online reinforcement learning for self-adaptive systems (keynote)

Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities

Reward Shaping in Episodic Reinforcement Learning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics