1 Introduction

The multi-armed bandit (MAB) problem is a fundamental construct in online learning, where a learner has to quickly identify the best option (a.k.a., arm) among a given set of options. In the stochastic MAB problem, each arm is associated with an (a priori unknown) reward distribution, and a sample from this distribution is revealed each time an arm is chosen (a.k.a., pulled). The classical goal is to use these samples to quickly identify the arm with the highest mean reward. The most popular metric to evaluate the performance of a learning algorithm is regret, which defined as a weighted sum of the expected number of plays of suboptimal arms, with the weights being equal to the suboptimality gaps.

While the classical stochastic MAB formulation has been applied in various application scenarios, including clinical trials, portfolio optimization, anomaly detection, and telecommunication (Bouneffouf & Rish, 2019), it ignores a key aspect of most real-world decision-making problems—namely, that they have multiple criteria of interest. For example, when comparing testing kits in a clinical trial, one would want to keep track of the false-positive rate as well as the false-negative rate of each kit. Similarly, choosing the best financial portfolio involves balancing risk and reward. A wireless node deciding which channel to transmit on, or what transmission rate to use has to balance several criteria, including throughput, delay, and energy consumption. This multi-criterion nature of decision making is not always adequately captured by the classical MAB approach of optimizing a one-dimensional reward signal.

The most common approach for incorporating multiple arm attributes into MAB formulations is to define the reward as a suitable function (say a linear combination) of the attributes of interest. For example, risk-aware portfolio optimization can be cast as an MAB problem where the best arm is one that optimizes a certain linear combination of mean value and a suitable risk measure (such as standard deviation or Conditional Value at Risk) (see, for example, Sani et al. (2012), Vakili and Zhao (2016), Kagrecha et al. (2019)]. However, this approach assumes that the different attributes of interest can be expressed and compared on a common scale, which is not always reasonable. For example, how does one ‘equate’ the impact a certain increment in risk to that of an another increment in the mean return of a portfolio?

A more natural approach for multi-criterion decision making is to instead pose the optimal choice as the solution of a constrained optimization. In other words, optimize one attribute subject to constraints on the others. However, despite the modest literature on multi-criterion MABs (surveyed below), little attention has been paid to formulating, and designing algorithms for constrained multi-criterion MABs. This paper seeks to fill this gap. Formally, we assume that each arm is associated with a D-dimensional probability distribution, and the best arm is the one that minimizes a certain arm attribute subject to constraints on m other attributes. For this setting, we pursue regret minimization, i.e., we seek to minimize the average number of pulls of non-optimal arms.

To simplify the presentation, and to highlight the key aspects of this problem, we first consider a single constraint (i.e., \(m = 1\)). Each arm is associated with a probability distribution, and we consider the problem of optimizing an objective attribute, subject to a single constraint attribute satisfying a user-specified constraint. The algorithm design and performance evaluation for this special case (with \(m = 1\)) generalize readily to the multi-criterion problem; see Sect. 5. An example that fits this template that is of significant interest in the finance community: the optimization of the expected return, subject to a risk constraint. Here, the objective attribute is the mean of an arm distribution, while the constraint attribute measuring risk could be Conditional Value at Risk (CVaR).

For this problem, we propose an algorithm, called Constrained Lower Confidence Bound (Con-LCB), that guarantees logarithmic regret, i.e., the average number of plays of all non-optimal arms (including those that violate the constraint) is at most logarithmic in the horizon. If Con-LCB is presented with an infeasbile instance, i.e., an instance where all arms violate the specified risk constraint, the algorithm in effect relaxes this constraint just enough to make at least one arm compliant. Another feature of Con-LCB is that at the end of the horizon, it outputs a boolean flag that correctly identifies with high probability, whether or not the given instance was feasible.

Finally, we establish fundamental lower bounds on the performance of any algorithm on this constrained regret minimization problem. Our results demonstrate a fundamental tradeoff between regret minimization and feasibility identification, similar to the well-known tradeoff between regret minimization and best arm identification in the classical (unconstrained) MAB problem (Bubeck et al., 2009).

The remainder of this paper is organized as follows. A brief survey of related literature is provided below. In Sect. 2, we introduce some preliminaries and formulate the constrained mean minimization problem. We present our algorithm (Con-LCB) and its performance guarantees in Sect. 3. Information-theoretic lower bounds on performance are discussed in Sect. 4. Finally, the general formulation for multi-criterion MABs is introduced in Sect. 5.

Related literature The literature related to multi-armed bandit problems is quite large. We refer the reader to Bubeck and Cesa-Bianchi (2012), Lattimore and Szepesvári (2020) for a comprehensive review. Here, we restrict ourselves to papers that consider (i) multi-objective MAB problems with vector rewards, and (ii) risk-aware arm selection.

For multi-objective MAB problems with vector rewards, different notions of optimality are considered. For example, Drugan and Nowe (2013), Yahyaa and Manderick (2015) consider the notion of Pareto optimality. In these papers, all dimensions are considered equally important and the aim is to play all Pareto-optimal arms an equal number of times. Another important notion of optimality is lexicographic optimality (see Ehrgott 2005). Here there is an order of importance among different dimensions. Tekin and Turğay (2018), Tekin (2019) consider the notion of lexicographic optimality for contextual bandit problems. In this line of work, the goal is to obtain higher reward in an important dimension and for tie-breaking, use rewards obtained in dimensions of lower importance.

Turning now to the literature on risk-aware arm selection, Sani et al. (2012), Galichet et al. (2013), Vakili and Zhao (2016), David and Shimkin (2016), Bhat and Prashanth (2019), Prashanth et al. (2020), Kagrecha et al. (2019) consider the problem of optimizing a risk metric alone or consider a linear combination of mean and a risk metric. Zimin et al. (2014) looks at the learnability of general functions of mean and variance, and Maillard (2013) proposes an optimization of the logarithm of moment generating function as a risk measure in a regret minimization framework. Cassel et al. (2018) look at path dependent regret and provides a general approach to study many risk metrics.

None of the above papers considers the constrained MAB problem, which frames the optimal arm as the solution of a constrained optimization problem. There has been some recent work for the constrained MAB problem. A constrained linear bandit setting is considered in Pacchiano et al. (2021) under the assumption that there is at least one arm which satisfies the constraints. The papers (Amani et al., 2019; Moradipari et al., 2019) consider the problem of maximizing the reward subject to satisfying a linear constraint with high probability. Constrained setting was also considered in David et al. (2018) and Chang (2020). Both these papers consider a single constraint on arm selection; David et al. (2018) considers a constraint on VaR, and Chang (2020) considers an average cost constraint (each arm has a cost distribution that is independent of its reward distribution). All the papers above implicitly assume that the instance presented is feasible, whereas we address the issue of encountering an infeasible instance.

Finally, we note that constraints appear in an entirely different context in the family of bandit problems called “bandits with knapsacks." Here, the goal is to maximize rewards under (vector) budget constraints, where the learning agent ceases to pull arms (and accumulate reward) once any one of the budget components is exhausted. The problem was introduced in Badanidiyuru et al. (2018) and some recent work in this space includes (Agrawal & Devanur, 2016; Sankararaman & Slivkins, 2021).

2 Problem formulation

In this section, we describe the formulation of our constrained stochastic MAB problem. To keep the exposition simple, we consider a single constraint (i.e., \(m = 1\)) for now; the general formulation with multiple constraints (i.e., \(m \ge 1\)) is discussed in Sect. 5. Informally, under the regret minimization objective considered here, the goal is to play, as often as possible, the arm that optimizes the objective, subject to a constraint on another attribute.

Formally, consider a multi-armed bandit problem with K arms, labeled \(1,2,\ldots ,K.\) Each arm is associated with a (possibly multi-dimensional) probability distribution, with \(\nu (k)\) denoting the (joint) distribution corresponding to arm \(k \in [K]\)Footnote 1. Suppose that \(\nu (k) \in {\mathcal {C}},\) the space of possible arm distributions. The objective and the constraint attributes are defined via functions \(g_0\) and \(g_1\) respectively, mapping \({\mathcal {C}}\) to \(\mathbb {R}.\) Additionally, the user provides a threshold \(\tau \in \mathbb {R}\) which specifies an upper bound on the attribute \(g_1.\) An instance of the constrainted MAB problem is defined by \((\nu ,\tau ),\) where \(\nu = (\nu (k),\ 1 \le k \le K) \in {\mathcal {C}}^K\) (here, \({\mathcal {C}}^K\) denotes the K-fold cartesian product of \({\mathcal {C}}\)). The arms that satisfy the constraint \(g_1(\nu (\cdot )) \le \tau\) are called feasible arms; the set of feasible arms is denoted by \(\mathcal {K}(\nu ),\) and the set of arms that do not satisfy the constraint is denoted by \(\mathcal {K}(\nu )^c.\) The instance \((\nu ,\tau )\) is said to be feasible if \(\mathcal {K}(\nu ) \ne \emptyset ,\) and is said to be infeasible if \(\mathcal {K}(\nu ) = \emptyset .\)

Consider first a feasible instance. An optimal arm in this case is defined as an arm that minimizes \(g_0(\nu (\cdot )),\) subject to the constraint \(g_1(\nu (\cdot )) \le \tau\). Denote the optimal value as \(g_0^* = \min _{k \in \mathcal {K}(\nu )} g_0(\nu (k)).\) Arms which have \(g_0(\nu (\cdot ))\) larger than \(g_0^*\) (whether or not feasible) are referred to as suboptimal arms. Note that there can also exist infeasible arms with a smaller objective \(g_0\) than \(g_0^*.\) We refer to such arms as deceiver arms; the set of deceiver arms is denoted by \(\mathcal {K}^d(\nu ),\) where \(\mathcal {K}^d(\nu ) = \{k \in [K] : g_1(\nu (k)) > \tau ,~ g_0(\nu (k)) \le g_0^*\}.\) For a suboptimal arm k,  the suboptimality gap is defined by \(\varDelta (k) := g_0(k) - g_0^* > 0.\) (Note that the suboptimality gap is also defined for infeasible, non-deceiver arms). Finally, for an infeasible arm k,  the infeasibility gap is defined as \(\varDelta _{\tau }(k) = g_1(k) - \tau > 0.\) Figure 1a provides a visual representation of a feasbile instance.

Next, consider an infeasible instance. The optimal arm in this case is defined in as the one with the smallest value of \(g_1(\nu (\cdot )).\) Let the optimal value be denoted by \(g_1^* = \min _{k \in [K]} g_{1}(\nu (k)) > \tau .\) This is equivalent to requiring that if the algorithm is faced with an infeasible instance, it must ’relax’ the constraint just enough, until at least one arm satisfies the constraint. The constraint gap for an arm k that is not optimal is defined as \(\varDelta _{\text {con}}(k) = g_1(\nu (k)) - g_1^* > 0.\) Figure 1b provides a visual representation of an infeasbile instance.

For any (feasible or infeasible) instance \((\nu ,\tau ),\) let the set of optimal arms be denoted by \(\mathcal {K}^*(\nu ).\) The total number of pulls (or horizon) is denoted by T. For an algorithm (a.k.a., policy) \(\pi\), the number of pulls of an arm k over the first t pulls, for \(t \in [T],\) is denoted by \(N^{\pi }_{k}(t),\) though we often suppress the dependence on \(\pi\) for simplicity.

Fig. 1
figure 1

The two types of instances

Consistency We now define the notion of consistency of an algorithm in this setting. A policy \(\pi\) is said to be consistent over a class of distributions \({\mathcal {C}},\) given a pre-specified constraint threshold \(\tau ,\) if for all instances \((\nu ,\tau )\) such that \(\nu \in {\mathcal {C}}^K,\) it holds that \(\mathbb {E}\left[ N_{k}(T)\right] = o(T^a)\) for all \(a > 0\) and for all (non-optimal) arms in \([K] \setminus \mathcal {K}^*(\nu ).\) This definition is in line with the definition of consistency in the classical unconstrained regret minimization setting (see Lai and Robbins 1985).

Regret The formal definition of regret in the present setting is as follows. For a feasible instance, there are two types of regret: suboptimality regret

$$\begin{aligned} R^{\textsf{sub}}_T := \sum _{k \in \mathcal {K}(\nu ) \setminus \mathcal {K}^*(\nu )} \varDelta (k) \mathbb {E}\left[ N_{k}(T)\right] , \end{aligned}$$

which is the regret caused due to the sampling of feasible, suboptimal arms, and infeasibility regret

$$\begin{aligned} R^{\textsf{inf}}_T := \sum _{k \in \mathcal {K}(\nu )^c} \varDelta _\tau (i) \mathbb {E}\left[ N_{k}(T)\right] , \end{aligned}$$

which is the regret caused due to the sampling of infeasible arms. For an infeasible instance, regret is caused by playing arms that are farther from the constraint boundary than the optimal arm. In this case, we define the constraint regret as

$$\begin{aligned} R^{\textsf{con}}_T := \sum _{k \in [K] \setminus \mathcal {K}^*(\nu )} \varDelta _{\text {con}}(k) \mathbb {E}\left[ N_{k}(T)\right] . \end{aligned}$$

(Note that our analysis essentially provides upper and lower bounds on \(\mathbb {E}\left[ N_{k}(T)\right]\) for all arms in \([K] \setminus \mathcal {K}^*(\nu ).\) So alternative definitions of regret, involving linear combinations of the expected number of pulls of non-optimal arms, are also supported by our analysis.)

Infeasibility identification Next, we introduce an optional boolean flag called feasibility_flag, that the policy may set at the end of T plays, to indicate post facto whether it considered the instance as feasible (by setting the flag as true) or infeasible (by setting the flag as false). For the algorithms proposed in this paper, we provide bounds on the probability that this flag erroneously flags a feasible instance as infeasible and vice-versa. We also provide fundamental lower bounds on the probability of error under any consistent policy.

Concentration inequalities on attribute estimators Natually, MAB algorithms must estimate the attributes \(g_0\) and \(g_1\) corresponding to each arm using the data samples obtained by pulling that arm. We make the following assumption on concentration properties of these estimators. Suppose that for \(i \in \{0,1\},\) and distribution \(F \in {\mathcal {C}},\) there exists an estimator \(\hat{g}_{i,n}(F)\) of \(g_i(F)\) using n i.i.d. samples from F,  satisfying the following concentration inequality: There exists \(a_i > 0\) such that for all \(\varDelta > 0,\)

$$\begin{aligned} {\textsf{Pr}}\left( \left| \hat{g}_{i,n}(F) - g_i(F) \right| \ge \varDelta \right) \le 2 \exp (-a_i n \varDelta ^2). \end{aligned}$$
(1)

Concentration inequalities of this form are available in a broad variety of settings. For example, if \(g_i(F) = \mathbb {E}\left[ h_i(X)\right] ,\) where X is a random vector distributed as F,  then a concentration inequality of the form (1) is readily obtained if \(h_i(X)\) is bounded (using the Hoeffding inequality), or \(\sigma\)-subGaussian. Similarly, if \(h_i\) is Lipschitz and X is a subGaussian random vector, concentration bounds of the form (1) can be obtained by invoking the results in Kontorovich (2014). Several examples where risk measures can be concentrated in this manner are provided in Cassel et al. (2018). Also, note that the specific form (1) of the concentration inequality is only assumed to simplify the description of our algorithms. Alternative forms of concentration guarantees (such as those known for the means of subexponential or heavy-tailed distributions) can also be supported by our algorithmic framework via trivial modifications to the confidence bounds.

3 Constrained regret minimization

In this section, we present an algorithm for constrained regret minimization, and provide performance guarantees for the same, assuming that we have estimators for each attribute that satisfy the concentration inequality (1).

The algorithm, which we refer to as constrained lower confidence bound (Con-LCB) algorithm (formal description presented as Algorithm 1) is based on the well-known principle of optimism under uncertainty. Con-LCB uses lower confidence bounds (LCBs) on attribute \(g_1\) of each arm to maintain a set of plausibly feasible arms, and uses LCBs for attribute \(g_0\) of the arms in this set to select the arm to be played. Note that LCBs are used for attribute \(g_0\) because we are dealing with minimization of \(g_0.\) If, at some instant, the set of plausibly feasible arms maintained by Con-LCB becomes empty, the algorithm turns conservative and plays the arm which violates the constraint least, i.e., the one with the smallest LCB on \(g_1\). Finally, at the end of T rounds, Con-LCB sets the feasiblity flag as true if the set of plausibly feasible arms is found to be non-empty, and false otherwise.

figure a

The remainder of this section is devoted to performance guarantees for Con-LCB. We consider feasible and infeasible instances separately.

Theorem 1

Consider a feasible instance. Under Con-LCB, the expected number of pulls of a feasible but suboptimal arm k (i.e., satisfying \(g_0(\nu (k)) > g_0^*\) and \(g_1(\nu (k)) \le \tau\)), is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_{k}(T)\right] \le \frac{4 \log (2T^2)}{a_0 \varDelta ^2(k)} + 5. \end{aligned}$$

The expected number of pulls of a deceiver arm k (i.e., satisfying \(g_0(\nu (k)) \le g_0^*\) and \(g_1(\nu (k)) > \tau\) is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_k(T)\right] \le \left( \frac{4 \log (2T^2)}{a_1 [\varDelta _{\tau }(k)]^2}\right) + 2. \end{aligned}$$

The expected number of pulls of an arm k which is an infeasible non-deceiver (i.e., satistying \(g_0(\nu (k)) > g_0^*\) and \(g_1(\nu (k)) > \tau\)) is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_k(T)\right] \le \min \left( \frac{4 \log (2T^2)}{a_0 \varDelta ^2(k)}, \frac{4 \log (2T^2)}{a_1 [\varDelta _{\tau }(k)]^2} \right) + 5. \end{aligned}$$

The probability of incorrectly setting the feasibility_flag is upper bounded by

$$\begin{aligned} {\textsf{Pr}}\left( \texttt {feasibility\_flag = false} \right) \le \frac{1}{T}. \end{aligned}$$

The main takeaways from Theorem 1 are as follows.

\(\bullet\) The upper bound on the expected number of pulls for feasible, suboptimal arms is logarithmic in the horizon T,  and inversely proportional to the square of the suboptimality gap. This is similar to bounds known for classical (unconstrained) MABs.

\(\bullet\) For deceiver arms, the upper bound on the expected number of pulls, also logarithmic in T,  is inversely proportional to the square of the feasiblity gap. This is similar to the bound one would obtain in a pure \(g_1\) minimization problem for a hypothetical instance consisting of all the originally infeasible arms, and a single hypothetical (optimal) arm having \(g_1\) equal to \(\tau .\)

\(\bullet\) The bound on the expected number of pulls of non-deceiver, infeasible arms involves a minimum of the dominant terms in the above two cases. Intuitively, this is because these arms can disambiguated in two ways: via the suboptimality gap, and the feasibility gap.

\(\bullet\) The probability that the feasiblity flag incorrectly identifies the instance as infeasible is upper bounded by a power law in the horizon T. Note that the specific form of the probability of mis-identification bound is not fundamental; a small modification in the algorithm would make this bound a faster decaying power law at the expense of a multiplicative bump in regret. However, that this probability is not much smaller (for example, exponentially decaying in the horizon T) is a consequence of an inherent tension between regret minimization and feasiblity identification.

A similar tension is known to exist between regret minimization and best arm identification in the unconstrained setting; see Bubeck et al. (2009)). We provide lower bounds on the probability that any consistent algorithm makes a mistake in feasibility identification in Sect. 4.

Finally, we note that the suboptimality regret, as well as the infeasibility regret are logarithmic in the horizon T. Indeed, the suboptimality regret for a feasible instance is bounded as

$$\begin{aligned} R^{\textsf{sub}}_T \le \sum _{k \in \mathcal {K}(\nu ) \setminus \mathcal {K}^*(\nu )} \left( \frac{4 \log (2T^2)}{a_0\varDelta (k)} + 5 \varDelta (k) \right) , \end{aligned}$$

which is similar to regret bounds in the classical unconstrained setting. To express the infeasibility regret compactly, let us interpret \(\varDelta (k)\) as 0 (and \(1/\varDelta (k)\) as \(\infty\)) for deceiver arms. With this notation, the infeasibility regret of a feasible instance is bounded as

$$\begin{aligned} R^{\textsf{inf}}_T \le \sum _{k \in \mathcal {K}(\nu )^c} 5\varDelta _\tau (k) + \min \left( \frac{4\varDelta _\tau (k) \log (2T^2)}{a_0\varDelta ^2(k)}, \frac{4\log (2T^2)}{a_1 \varDelta _\tau (k)} \right) . \end{aligned}$$

Next, we move on to characterizing the performance of Con-LCB over infeasible instances.

Theorem 2

Consider an infeasible instance. Under Con-LCB, the expected number of pulls of a non-optimal arm k is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_{k}(T)\right] \le \left( \frac{4 \log (2T^2)}{a_1 [\varDelta _\mathrm{{con}}(k)]^2}\right) + K + 2. \end{aligned}$$

Moreover, the probability that the algorithm incorrectly flags the instance as feasible is bounded as \({\textsf{Pr}}\left( \texttt {feasibility\_flag} = \texttt {true} \right) \le \frac{K}{T} \text { for } T>T^*(\nu ),\) where \(T^*(\nu )\) is an instance-dependent constant.

For an infeasible instance, the upper bound on the expected number of pulls of a non-optimal arm, logarithmic in the horizon, and inversely proportional to the square of the constraint gaps \(\varDelta _\mathrm{{con}}(k),\) is structurally similar to the bound one would obtain in pure \(g_1\) minimization problem on the same instance. However, note that when faced with an infeasible instance, Con-LCB would only start playing the optimal arm regularly after all K arms appear to be infeasible; this explains the appearance of K in our bounds.

Here, the constraint regret is bounded as

$$\begin{aligned} R^{\textsf{con}}_T := \sum _{k \in [K] \setminus \mathcal {K}^*(\nu )} (K+2) \varDelta _\mathrm{{con}}(k) + \frac{4 \log (2T^2)}{a_1 \varDelta _\mathrm{{con}}(k)} \end{aligned}$$

Finally, as before, the probability that the feasibility flag wrongly identifies the infeasible instance as feasible decays as a power law in the horizon for \(T > T^*(\nu );\) the threshold \(T^*\) accounts for the time it takes for the algorithm to ‘detect’ that the instance is infeasibile with high probability.

4 Information theoretic lower bounds

In this section, we establish fundamental limits on the performance of algorithms for constrained regret minimization. First, we show that the regret bounds obtained for Con-LCB are asymptotically tight upto universal multiplicative constants (on a class of Gaussian bandit instances). We then prove a lower bound on the probability that any consistent algorithm misidentifies a feasible instance as infeasible or vice-versa. This result illustrates an inherent tension between regret minimization and feasibility identification—consistent algorithms (recall that consistency means regret is \(o(T^a)\) for all \(a > 0\)) cannot have a misidentification probability that decays exponentially in the horizon, and algorithms that enjoy a mid-identification probability that decays exponentially in the horizon cannot be consistent.

To state our information theoretic lower bound on regret, suppose that the class of arm distributions \(\mathcal {G}\) is the class of 2-dimensional Gaussian distributions with covariance matrix \(\varSigma = \textrm{diag}(\nicefrac {1}{2 a_0}, \nicefrac {1}{2 a_1}).\) Let attribute \(g_0\) be the mean of the first dimension, and \(g_1\) be the mean of the second dimension. For the assumed covariance matrix structure, the standard empirical mean estimators for \(g_0\) and \(g_1\) satisfy the concentration properties stated in (1).

Theorem 3

Let \(\pi\) be any consistent policy over the class of distributions \(\mathcal {G}\) given the threshold \(\tau \in \mathbb {R}.\) Consider a feasible instance \((\nu ^f,\tau ),\) where \(\nu ^f \in \mathcal {G}^K.\) For any feasible but suboptimal arm k

$$\begin{aligned} \liminf _{T \rightarrow \infty } \frac{\mathbb {E}\left[ N^{\pi }_{k}(T)\right] }{\log (T)} \ge \frac{1}{a_0 \varDelta ^2(k)}. \end{aligned}$$

For any deceiver arm k

$$\begin{aligned} \liminf _{T \rightarrow \infty } \frac{\mathbb {E}\left[ N^{\pi }_{k}(T)\right] }{\log (T)} \ge \frac{1}{a_1 [\varDelta _\tau (k)]^2}. \end{aligned}$$

Finally, for any infeasible non-deceiver arm k

$$\begin{aligned} \liminf _{T \rightarrow \infty } \frac{\mathbb {E}\left[ N^{\pi }_{k}(T)\right] }{\log (T)} \ge \frac{1}{2}\min \left( \frac{1}{a_0 \varDelta ^2(k)}, \frac{1}{a_1 [\varDelta _\tau (k)]^2} \right) . \end{aligned}$$

Similarly, for an infeasible instance \((\nu ^i,\tau ),\) such that \(\nu ^i \in \mathcal {G}^K,\) for any non-optimal arm k

$$\begin{aligned} \liminf _{T \rightarrow \infty } \frac{\mathbb {E}\left[ N^{\pi }_{k}(T)\right] }{\log (T)} \ge \frac{1}{a_1 [\varDelta _\text {con}(k)]^2}. \end{aligned}$$

The proof of Theorem 3 can be found in “Appendix B”. Comparing the lower bounds in Theorem 3 with the upper bounds for Con-LCB in Theorems 1 and 2, we conclude that Con-LCB is asymptotically optimal on regret, up to universal multiplicative constants.Footnote 2

Next, we address the fundamental tradeoff between regret minimization and feasibility identification.

Theorem 4

Consider a space of arm distributions \({\mathcal {C}},\) and a threshold \(\tau\) such that \({\mathcal {C}}\) contains both feasible as well as infeasible arm distributions. There exists a feasible instance \((\nu ,\tau )\) and an infeasible instance \((\nu ^{\prime}, \tau )\) such that for any policy \(\pi\) that is consistent over \({\mathcal {C}},\)

$$\begin{aligned} \limsup _{T \rightarrow \infty } -\frac{1}{T}&\log \Big (\mathbb {P}_{\nu }\left( \texttt {feasibility\_flag} = \texttt {false} \right) \\&\quad + \mathbb {P}_{\nu ^{\prime}}\left( \texttt {feasibility\_flag} = \texttt {true} \right) \Big ) \le 0. \end{aligned}$$

Theorem 4 states that for any consistent algorithm, the probability that \((\nu ,\tau )\) get misidentified as infeasible, and the probability that \((\nu ^{\prime},\tau )\) get misidentified as feasible, cannot both decay exponentially in the horizon. This is of course consistent with the power-law probability of misidentification under Con-LCB. In other words, slower-than-exponential decay of the probability of feasibility misidentification with respect to the horizon is an unavoidable consequence of the exploration-exploitation interplay in regret minimization. A similar tension between regret minimization and best arm identification was noted for the unconstrained MABs by Bubeck et al. (2009).

5 General framework for constrained MABs

In this section, we provide a general formulation for constrained stochastic MABs with multiple criteria. We allow each arm to be associated with a D-dimensional probability distibution, the goal being to optimize one (dominant) attribute associated with this distribution, subject to constraints on m others. The algorithm design and performance evaluation performed in Sects. 24 for the special case of \(m=1\) extend naturally to this general formulation, which can in turn be applied in a variety of application scenarios. We illustrate a few here.

\(\bullet\) For clinical trials, with the arms corresponding to various treatment protocols, the dominant attribute might, for example, correspond to the success/recovery probability, whereas the constraints might capture recovery time, severity of side effects, etc.

\(\bullet\) For product/service rating, where the arms correspond to various service providers, the dominant attribute might correspond to product quality, with constraints capturing reliability, pricing, customer service, etc.

\(\bullet\) In wireless networks, the arms might correspond to various access networks or channels, with, for example, the dominant attribute corresponding to throughput, and constraints capturing delay, energy efficiency, etc.

Formulation Consider a set of K arms, each associated with a D-dimensional probability distribution, with \(\nu (k)\) denoting the joint distribution corresponding to arm \(k \in [K].\) Suppose that \(\nu (k) \in {\mathcal {C}},\) the space of possible arm distributions. The objective and the constraints are defined by functions \(g_0, g_1, \ldots , g_m.\) Specifically, the optimal arm is defined as that arm k that minimizes \(g_0(\nu (k)),\) subject to the constraints \(\{g_i(\nu (k)) \le \tau _i\}_{i=1}^m\) when the instance is feasible (i.e., at least one arm exists that satisfies the above constraints).

If the instance is infeasible, the optimal arm is defined via a ranked list of the constraints, that orders them by ‘importance’. Without loss of generality assume that the order of importance increases from \((g_1, \tau _1)\) to \((g_m, \tau _m).\) The idea is to relax the constraints one-by-one, starting with the least important, until a compliant arm is found. Formally, for a given infeasible instance \(\nu\), for \(2 \le i \le m,\) let \(\mathcal {K}_i(\nu )\) denote the set of arms that satisfy the constraints \(\{g_j(\nu (k)) \le \tau _j\}_{j=i}^m.\) Let us also define \(\mathcal {K}_{m+1}(\nu ):= [K].\) Now, let \(i^*(\nu ) = \min \{k \in [m]:\ \mathcal {K}_{k+1}(\nu ) \ne \phi \}.\) Here, \(i^*(\nu )\) is the fewest number of constraints one must relax, in order of increasing importance, in order to have at least one compliant arm. An optimal arm is then defined to be \({\mathrm{arg\,min}}\{g_{i^*(\nu )}(\nu (k)):\ k \in \mathcal {K}_{i^*(\nu )+1}(\nu )\}.\)

Similar to the case where \(m=1,\) we make the following assumption to simplify algorithm design. Suppose that for \(0 \le i \le m,\) and \(\overline{\nu } \in {\mathcal {C}},\) there exist an estimator \(\hat{g}_{i,n}(\overline{\nu })\) for \(g_i(\overline{\nu })\) using n i.i.d. samples from \(\overline{\nu },\) satisfying the following concentration inequality: There exists \(a_i > 0\) such that for all \(\varDelta > 0,\)

$$\begin{aligned} {\textsf{Pr}}\left( \left| \hat{g}_{i,n}(\overline{\nu }) - g_i(\overline{\nu }) \right| \ge \varDelta \right) \le 2 \exp (-a_i n \varDelta ^2). \end{aligned}$$
(2)

Note that this is simply an extension of our assumption (1).

Algorithm and performance guarantees For the above problem formulation, we propose Multi-Con-LCB, a simple extension of the Con-LCB algorithm that we presented before for the case of \(m=1.\) Multi-Con-LCB uses upper confidence bounds on constrained attributes \(\{g_i(\nu (k))\}_{i=1}^{m}\) for each arm k to maintain a set of plausibly feasible arms. The set of plausibly feasible arms for the constraint on function \(g_i\) at time \(t \in [T]\) will be denoted by \(\hat{\mathcal {K}}_{i,t}^{\dagger }\) for all values of \(i \in [m].\) Using the sets of plausibly feasible arms for each constraint, we construct the set of arms that plausibly lie in the set \(\mathcal {K}(\nu )\) and this set is denoted by \(\hat{\mathcal {K}}_t\) for time instant \(t \in [T].\) Formally, \(\hat{\mathcal {K}}_t = \cap _{i=1}^{m} \hat{\mathcal {K}}_{i,t}^\dagger .\) As this set might be empty, for \(i \in \{2,\ldots ,m\},\) let the estimate for \(\mathcal {K}_i(\nu )\) be \(\hat{\mathcal {K}}_{i,t} = \cap _{j=i}^{m} \hat{\mathcal {K}}_{j,t}^\dagger .\) It is also possible that the most important constraint is not satisfied, therefore let \(\hat{\mathcal {K}}_{m+1, t} = [K].\) If the set \(\hat{\mathcal {K}}_t\) is not empty, then the algorithm uses lower confidence bounds (LCBs) on \(g_0\) to select the arm to be played. If the set \(\hat{\mathcal {K}}_t\) turns out to be empty, then the algorithm finds the smallest index \(\hat{i}^*\) such that the set \(\hat{\mathcal {K}}_{\hat{i}^*+1,t}\) is not empty. The algorithm then plays the arm with lowest LCB on \(g_{\hat{i}^*}.\) Finally, at the end of T rounds, Multi-Con-LCB sets the feasibility flag as true if the set \(\hat{\mathcal {K}}_T\) is not empty and false otherwise. The details are presented as Algorithm 2.

figure b

The remainder of this section is devoted to performance guarantees for Multi-Con-LCB. The suboptimality gap for an arm k is given by \(\varDelta (k) = \max (g_0(\nu _k) - g_0^*, 0).\) The infeasibility gap of an arm k for constraint i is given by \(\varDelta _{i, \tau _i}(k) = \max (g_i(\nu (k))-\tau _i, 0).\) We restrict our attention to feasible instances here; infeasible instances can be handled on similar lines as Theorem 2 for Con-LCB.

Theorem 5

Consider a feasible instance. Under Multi-Con-LCB, the expected number of pulls of a feasible but suboptimal arm k (i.e., satisfying \(g_0(\nu (k)) > g_0^*\) and \(g_i(\nu (k)) \le \tau _i\) for all \(i \in [m]\)), is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_{k}(T)\right] \le \frac{4 \log (2T^2)}{a_0 \varDelta ^2(k)} + 2m + 3. \end{aligned}$$

The expected number of pulls of a deceiver arm k (i.e., satisfying \(g_0(\nu (k)) \le g_0(\nu (1))\) and there exists a constraint indexed by \(j \in [m]\) such that \(g_j(\nu (k)) > \tau _j\)) is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_k(T)\right] \le \min _{i \in [m]} \left( \frac{4 \log (2T^2)}{a_i [\varDelta _{i, \tau _i}(k)]^2}\right) + m. \end{aligned}$$

The expected number of pulls of a an arm k which is infeasible, but not a deceiver (i.e., satistying \(g_0(\nu (k)) > g_0(\nu (1))\) and there exists a constraint indexed by \(j \in [m]\) such that \(g_j(\nu (k)) > \tau _j\)) is bounded by

$$\begin{aligned} \mathbb {E}\left[ N_k(T)\right] \le \min \left( \frac{4 \log (2T^2)}{a_0 \varDelta ^2(k)}, \min _{i \in [m]} \left( \frac{4 \log (2T^2)}{a_i [\varDelta _{i, \tau _i}(k)]^2}\right) \right) + 2m + 3. \end{aligned}$$

The probability of incorrectly setting the feasibility_flag is upper bounded by

$$\begin{aligned} {\textsf{Pr}}\left( \texttt {feasibility\_flag = false} \right) \le \frac{m}{T}. \end{aligned}$$

We omit the proof of Theorem 5 in the appendix because it is very similar to the proof of Theorem 1. However, we state the key takeaways from Theorem 5 below.

\(\bullet\) Similar to Theorem 1, the upper bound on the expected number of pulls of suboptimal arms is logarithmic in T and is inversely proportional to the suboptimality gap squared. Moreover, the upper bound for a deceiver arm is also logarithmic in T but there is a minimum over m terms because the deceiver arm can be identified by any of the constraints that the arm does not satisfy. Similarly, the upper bound for a non-deceiver, suboptimal arm is also logarithmic in T and there is a minimum over \(m+1\) terms because the arm can be identified as suboptimal, or infeasible for not satisfying one of the m constraints.

\(\bullet\) With an increase in the number of constraints to m, the upper bounds on the expected number of pulls of non-optimal arms increase linearly in m and the probability of mis-identification also gets scaled by a factor of m. The slight degradation in the guarantees is expected because with more constraints, the optimal arm has to satisfy more conditions, and it becomes harder to compare different arms.

6 Numerical experiments

In this section, we present a simple experiment to show the numerical performance of our algorithm Con-LCB.

We consider an instance with four arms and two criteria. Each arm is associated with a (two-dimensional) multivariate Gaussian distribution, different arms having the same covariance matrix \(\varSigma ,\) but different mean vectors. The means corresponding to the first dimension are 0.3, 0.4, 0.2, and 0.5 and the means corresponding to the second dimension are 0.4, 0.4, 1.0, and 1.0. The covariance matrix is given by \(\varSigma = \begin{pmatrix} 1 &{} 0.5 \\ 0.5 &{} 1 \end{pmatrix}\).

The goal is to maximize the pulls of the arm with the minimum mean of the first dimension, subject to a constraint on the mean of the second. Specifically, we require that the mean of the second dimension should be less than or equal to \(\tau := 0.5.\) The instance is summarized in Table 1. We use empirical averages as the estimators for the means and standard confidence bounds based on sub-Gaussianity assumption. We run the algorithm 1000 times for values of T in [10000, 100000]. The suboptimality regret \(R_{T}^{\text {sub}}\) and the infeasibility regret \(R_{T}^{\text {inf}}\) are plotted in Figure 2. Clearly, the regrets grow sub-linearly with respect to the horizon. Also, in our experiments, the algorithm correctly detected the feasibility of the instance in all of the 1000 runs.

Table 1 Gaussian instance
Fig. 2
figure 2

Regret of Gaussian instance

We consider another instance where arms are Beta distributed. The goal here is to maximize the pulls of the arm with the minimum mean, subject to a constraint on the variance. Specifically, we require that the variance of the arms should be less than or equal to \(\tau := 0.1.\) The instance is summarized in Table 2. We use the empirical average to estimate the mean of the arms and the standard unbiased estimator of variance to estimate the variance of the arms. The concentration bounds we use are based on fact that underlying arms are bounded between 0 and 1. We run the algorithm 100 times for values of T in [10000, 100000]. The suboptimality regret \(R_{T}^{\text {sub}}\) and the infeasibility regret \(R_{T}^{\text {inf}}\) are plotted in Figure 3. Similar to the previous case, the algorithm correctly detected feasibility in all of the 100 runs.

We also compare Con-LCB with an algorithm designed to optimize a single ‘Lagrange relaxed’ objective of the form \(g_0 + \beta g_1.\) Since there is no systematic way of setting \(\beta\) such that the original (constrained) optimal arm also minimizes this metric, this approach is very fragile, as we demonstrate in Figure 4. The instance we use is the Beta distributed instance in Table 2, and we are plotting the sum of the infeasible regret and suboptimal regret. When \(\beta\) is small (\(\beta <10/7\) here), a deceiver arm is optimal for the relaxed objective, and when \(\beta\) is large (\(\beta >5\) here), a suboptimal arm is optimal for the relaxed objective; the ‘correct’ range of \(\beta\) being instance dependent (and a priori unknown).

Table 2 Beta instance
Fig. 3
figure 3

Regret of Beta distributed instance

Fig. 4
figure 4

Comparison with the Lagrangian relaxation

7 Concluding remarks

In this paper, we have introduced a general formulation of multi-criterion constrained MABs, which is applicable when there are multiple attributes of interest associated with each arm. We propose algorithms that incur logarithmic regret, and also provide information theoretic lower bounds on the performance of any algorithm. An interesting departure from the classical MAB formulation is the aspect of instance feasibility; our algorithms predict, post facto, with a probability of error that decays as a power law in the horizon, whether the instance presented was feasible. Interestingly, this illustrates a fundamental tradeoff between regret minimization and feasibility detection. Finally, our algorithms ‘auto-tune’ to an infeasible instance, relaxing the constraints on arm attributes (in order of increasing importance), until a compliant arm is found.

The proposed framework and our algorithms can be applied in a wide range of application scenarios [see Bouneffouf and Rish (2019) for a survey]. Further, this work motivates the study of multi-criterion variants of other bandit formulations, including contextual bandits, combinatorial bandits, and correlated bandits.