1 Introduction

Recommender systems are a class of online reinforcement learning algorithms that interact sequentially with an environment suggesting relevant items to a user. During the last decade, there has been an increasing interest in online recommendation techniques due to the importance of advertisement recommendation for e-commerce websites or the rise of movies and music streaming platforms (Gomez-Uribe and Hunt 2016; Dragone et al. 2019). Among different settings for recommender systems, in this work, we focus on the contextual bandit framework applied to the recommendation of quantum data. The contextual bandit problem is a variant of the multi-armed bandit problem where a learner at each round receives a context and given a set of actions (also called arms) has to decide the best action using the context information. After selecting an action, the learner will receive a reward, and for the next rounds, they will use the previous information of contexts and rewards in order to make their future choices. As in the classical multi-armed bandit problem, the learner has to find a balance between exploration and exploitation; exploration refers to trying different actions in order to eventually learn the ones with the highest reward, and exploitation refers to selecting the actions that apparently will give the highest reward immediately. For a comprehensive review of bandit algorithms, we refer to the book by Lattimore and Szepesvári (2020). Some real-life applications (Bouneffouf et al. 2020) of bandit include clinical trials (Durand et al. 2018), dynamic pricing (Cohen et al. 2020), advertisement recommendation (Tang et al. 2013), or online recommender systems (Li et al. 2010; McInerney et al. 2018). As an example, in Li et al. (2010), a news article recommender system was considered where the context is the user features, the actions are the articles to recommend, and the reward is modeled as a binary outcome indicating that the user clicks or not on the recommended article.

Fig. 1
figure 1

Sketch of a recommender system for quantum data. The learner receives sequentially quantum contexts and feed them to the classical processing system. The context is also fed to the measurement system. The classical processing system uses the information about the context to pick one of the quantum processes (no information regarding these processes are known besides from measurements). The chosen quantum process is applied to the measurement system, and the measurement outcome is fed to the classical processing and is added to the cumulative reward

Quantum algorithms for the classical multi-armed bandit problem have been studied for the settings of best-arm identification (Casalé et al. 2020; Wang et al. 2021), exploration-exploitation with stochastic environments (Wan et al. 2022) (uncorrelated and linear correlated actions), and adversarial environments (Cho et al. 2022). Also, a quantum neural network approach was considered in Hu et al. (2019) for a simple best-arm identification problem. A quantum algorithm for a classical recommender system was considered in Kerenidis and Prakash (2016), claiming an exponential speedup over known classical algorithms, but later in Tang (2019), it was proven that the price of the speedup comes from the assumptions of the quantum state preparation part and argued that under related classical assumptions, a classical algorithm can also achieve the speedup. There are other more general reinforcement learning frameworks beyond bandits where actions affect the rewards in the long term such as Markov decision process. The quantum generalization of this framework has been considered in Barry et al. (2014); Ying et al. (2021), and although our model of study falls into their class, we can derive more concrete results since we study a specific setting.

We are interested in studying a recommender system for quantum data that is modeled by a set of unknown quantum processes—which is called the environment, and a set of tasks to perform using these quantum processes—which is called a context set (Fig. 1). A learner interacts sequentially with the environment receiving at each round a task from the context set and then choosing the best quantum process to perform this task. For example, we could model the environments as a set of noisy quantum computers, the context set as a set of different quantum algorithms, and then at each round, the learner is given a quantum algorithm to run, and their goal is to recommend the best quantum computer to do this task. We note that this model exemplifies the bandit exploration-exploitation trade-off since the learner has to try (explore) the different quantum computers in order to decide the best one but at the same time has to choose the best one (exploitation) to perform the task. This trade-off is interesting in a practical scenario because it captures settings where online decisions are important and or they have some associated cost that makes the learner always try to perform optimally. In our example, one could think that using a quantum computer costs money for the learner, so at each stage, they always want to select the ones that will output the best solutions.

In our work, we extend the setting considered in Lumbreras et al. (2022) where they studied the exploration-exploration trade-off of learning properties of quantum states. In our model, the environment is a set of unknown quantum states, the context set is a (finite or infinite) set of observables, and at each round, the learner receives an observable and has to perform a measurement on one of the unknown quantum states (the recommendation) aiming to maximize its outcome. We define this problem as the quantum contextual bandit (QCB), and we note that it falls into the class of linear contextual bandits (Abe et al. 2003; Auer 2003; Chu et al. 2011). The QCB is the basic framework where we formulate our recommender system for quantum data. We use as a figure of merit the regret, which is the cumulative sum of the difference between the expected outcome of the best and selected action at each round. We note that although our model is a subset of the problem of linear contextual bandit, it still suffers the same regret scaling in a worst-case scenario (see Sect. 3). The main goal of this work is to show how the bandit framework can be applied for recommender systems of quantum data rather than focusing on finding particular settings where the regret has a better scaling. Finding a strategy that minimizes regret implies finding the mentioned balance between exploration-exploitation of the different actions. As a concrete recommendation task captured by the QCB model, we consider the low energy quantum state recommendation problem. In this problem, at each round, the learner receives a quantum Hamiltonian and has to recommend from the environment the state with the lowest energy. The ground state preparation problem is an important ingredient of NISQ algorithms (Bharti et al. 2022), and our model could be useful in order to implement an online recommendation algorithm that helps the learner choose the best ansatz for their energy minimization task when they have multiple problems to solve. One of the advantages of using bandit algorithms for this task is that they do not need to reconstruct the whole d-dimensional state, just the relevant part for the recommendation which depends on the structure of the context set. In order to do that, we combine a Gram-Schmidt procedure with classical linear bandit strategies. This allows our algorithm to store low-dimensional approximations of the unknown quantum states without prior knowledge of the context set. We remark that the goal of our problem is slightly different than the usual ground state finding problem since we focus on the recommendation part, where we just want to recommend a “good” enough state.

We also perform some numerical studies of the scaling of the regret for the cases where the context set is an Ising model and a generalized cluster model studied in Verresen et al. (2017). For these models, we propose unknown actions for the algorithm corresponding to ground states located at different phases, and then for each context received by the algorithm, we associate each action to a different phase and we reproduce a ground state phase diagram. We observe that the recommendation of the algorithm is done approximately by classifying the different phases of the studied models, and we are able to clearly distinguish them in the phase diagram.

The rest of the paper is organized as follows: in Sect. 2, we establish the mathematical model for the quantum contextual bandit and then define the notation used throughout the paper; in the next section, Sect. 3, we prove the lower bound on a performance metric (expected regret, which we define in Sect. 2) over all possible algorithms. In Sect. 4, we review the linear Upper Confidence Bound algorithm. In Sect. 5, we describe the low-energy recommendation system and adapt the LinUCB algorithm to this setting. We illustrate the efficiency of the algorithm through simulations of different context sets.

2 The model

First, we introduce some notation in order to define our model and present our results. We define \([T] = \{ 1,...,T \} \) for \(T\in \mathbb {N}\). Let \(\mathcal {S}_d = \lbrace \rho \in \mathbb {C}^{d \times d}: \rho \ge 0 \wedge {{\,\textrm{Tr}\,}}( \rho ) = 1 \rbrace \) denote the set of positive semi-definite operators with unit trace, i.e., quantum states that act on a d-dimensional Hilbert space \(\mathbb {C}^d\). Moreover, observables are Hermitian operators acting on \(\mathbb {C}^d\), collected in the set \(\mathcal {O}_d = \lbrace O\in \mathbb {C}^{d\times d}: O^{\dagger } = O \rbrace \). We denote real d-dimensional column vectors as \(\textbf{v}\) and the inner product of two of them \(\textbf{u},\textbf{v}\in \mathbb {R}^d\) as \(\textbf{u}^\top \textbf{v}\), where \(\textbf{u}^\top \) denotes the transpose of \(\textbf{u}\) that is a row vector. We use \(\Vert \cdot \Vert _2\) in order to denote the 2-norm of a real vector. For a n-qubit system with Hilbert space dimension \(d=2^n\), we denote \(X_i,Y_i\), and \(Z_i\) the xyz Pauli operators acting on the i-th qubit (\(1\le i \le n\)). A Pauli observable can be expressed as the n-fold tensor product of the \(2 \times 2\) Pauli matrices, i.e., it is an element of the set \(\left\{ I,X,Y,Z \right\} ^{\otimes n}/ I_{4^n \times 4^n}\). Note that there are \(4^n - 1\) such observables, and each of them is orthogonal and therefore forms a basis, which we refer to as the Pauli basis.

The definition of our model takes some of the conventions used for the multi-armed quantum bandit (MAQB) problem (Lumbreras et al. 2022).

Definition 1

(Quantum contextual bandit) Let \(d\in \mathbb {N}\). A d-dimensional quantum contextual bandit is given by a set of observables \(\mathcal {C} = \lbrace O_c \rbrace _{c\in \Omega _{\mathcal {C}}}\subseteq \mathcal {O}_d\) that we call the context set, \((\Omega _{\mathcal {C}},\Sigma _{\mathcal {C}})\) is a measurable space, and \(\Sigma _{\mathcal {C}}\) is a \(\sigma \)-algebra of subsets of \(\Omega _{\mathcal {C}}\). The bandit is in an environment, a finite set of quantum states \(\gamma = \lbrace \rho _1,\rho _2,\cdots , \rho _k \rbrace \subset \mathcal {S}_d \), that it is unknown. The quantum contextual bandit problem is characterized by the tuple \((\mathcal {C},\gamma )\).

Given the environment \(\gamma \) such that \( |\gamma | = k\), we define the action set \(\mathcal {A} = \lbrace 1,...,k \rbrace \) as the set of indices that label the quantum states \(\rho _i\in \gamma \) in the environment. For every observable \(O_c \in \mathcal {C}\), the spectral decomposition is given by

$$\begin{aligned} O_c = \sum _{i=1}^{d_c} \lambda _{c,i}\Pi _{c,i}, \end{aligned}$$
(1)

where \(\lambda _{c,i} \in \mathbb {R}\) denote the \(d_c \le d\) distinct eigenvalues of \(O_c\) and \(\Pi _{c,i}\) are the orthogonal projectors on the respective eigenspaces. For each action \(a\in \mathcal {A}\), we define the reward distribution with outcome \(R \in \mathbb {R}\) as the conditional probability distribution associated with performing a measurement using \(O_c\) on \(\rho _a\) given by Born’s rule

$$\begin{aligned}&Pr\left[ R = r | A = a , O= O_c \right] \nonumber \\&\quad = P_{\rho _a}(r|a,c) = \left\{ \begin{array}{ll} {{\,\textrm{Tr}\,}}( \rho _a \Pi _{c,i} ) \text { if } r= \lambda _{c,i}, \\ 0 \text { else. } \end{array}\right. \end{aligned}$$
(2)

With the above definitions, we can explain the learning process. The learner interacts sequentially with the QCB over T rounds such that for every round \(t\in [T]\):

  1. 1.

    The learner receives a context \(O_{c_t} \in \mathcal {C}\) from some (possibly unknown) probability measure \(P_\mathcal {C}:\Sigma \rightarrow [0,1]\) over the set \(\Omega _\mathcal {\mathcal {C}}\).

  2. 2.

    Using the previous information of received contexts, actions played, and observed rewards, the learner chooses an action \(A_t\in \mathcal {A}\).

  3. 3.

    The learner uses the context \(O_{c_t}\) and performs a measurement on the unknown quantum state \(\rho _{A_t}\) and receives a reward \(R_t\) sampled according to the probability distribution (2).

We use the index \(c_t \in [m]\) to denote the observable \(O_{c_t}\) received at round \(t\in [T]\). The strategy of the learner is given by a set of (conditional) probability distributions \(\pi = \lbrace \pi _t \rbrace _{t\in \mathbb {N}}\) (policy) on the action index set [k] of the form

$$\begin{aligned} \pi _t (a_t|a_1,r_1,c_1,...,a_{t-1},r_{t-1},c_{t-1}, c_t ), \end{aligned}$$
(3)

defined for all valid combinations of actions, rewards, and contexts \((a_1,r_1,c_1,...,a_{t-1},r_{t-1},c_{t-1} )\) up to time \(t-1\). Then, if we run the policy \(\pi \) on the environment \(\gamma \) over \(T\in \mathbb {N}\) rounds, we can define a joint probability distribution over the set of actions, rewards, and contexts as

$$\begin{aligned} P_{\gamma ,\mathcal {C},\pi } (a_1,X_1,C_1,...,a_T,X_T,C_T) = \int _{C_T} \int _{X_T}\cdots \int _{C_1} \int _{X_1} \prod _{t=1}^T \pi _t (a_t|a_1,r_1,c_1,...,a_{t-1},r_{t-1},c_{t-1} ) \times \nonumber \\ \times P_{\mathcal {C}} (dc_1) P_{\rho _{a_1}} (d r_1 |a_1,c_1 )\cdots P_{\mathcal {C}} (dc_T) P_{\rho _{a_T}} (d r_T |a_T,c_T ). \end{aligned}$$
(4)

Thus, the conditioned expected value of reward \(R_t\) is given by

$$\begin{aligned} {\,\mathbb {E}\,}_{\gamma ,\mathcal {C},\pi } [ R_{t} | A_t = a , O_{c_t} = O_c ] = {{\,\textrm{Tr}\,}}( \rho _a O_c ), \end{aligned}$$
(5)

where \({{\,\mathrm{\mathbb {E}}\,}}_{\gamma ,\mathcal {C},\pi } \) denotes the expectation value over the probability distribution Eq.4. The goal of the learner is to maximize its expected cumulative reward \(\sum _{t=1}^T {{\,{\mathbb {E}}\,}}_{\gamma ,\mathcal {C},\pi } \left[ R_t \right] \) or equivalently minimizing the cumulative expected regret

$$\begin{aligned} \text {Regret}_T^{\mathcal {\gamma ,C},\pi } = \sum _{t=1}^T {{\,{\mathbb {E}}\,}}_{\gamma ,\mathcal {C},\pi }\left[ \max _{\rho _i\in \gamma }{{\,\textrm{Tr}\,}}(\rho _iO_{c_t}) - R_t \right] . \end{aligned}$$
(6)

For a given action \(a\in \mathcal {A}\) and context \(O_c \in \mathcal {C}\), the sub-optimality gap is defined as

$$\begin{aligned} \Delta _{a,O_c}=\max _{i\in \mathcal {A}}{{\,\textrm{Tr}\,}}(\rho _{i}O_c) - {{\,\textrm{Tr}\,}}(\rho _{a}O_c). \end{aligned}$$
(7)

Note that the learner could try to learn the distribution of contexts \(P_\mathcal {C}\); however, this will not make a difference in minimizing the regret. The strategy of the learner has to be able to learn the relevant part of the unknown states \(\lbrace \rho _a \rbrace _{a=1}^{k}\) that depend on the context set and at the same time balance the tradeoff between exploration and exploitation. We note that it is straightforward to generalize the above setting to continuous sets of contexts \(\mathcal {C}\). In order to do that, we need a well-defined probability distribution \(P_\mathcal {C} (O) dO\) over the context set \(\mathcal {C}\).

3 Lower bound

In this section, we derive a lower bound for the cumulative expected regret by finding, for any strategy, a QCB that is hard to learn. Our regret lower bound proof for the QCB model relies on a reduction to a classical multi-armed stochastic bandit given in Theorem 5.1 in Auer et al. (2003). Now, we briefly review the multi-armed stochastic bandit problem.

The multi-armed stochastic bandit problem is defined by a discrete set of probability distributions \(\nu = (P_a: a\in [k] )\) that is called the environment, and \(\mu _i\) is the mean of the probability distribution \(P_i\) for \(i\in [k]\). The learner interacts sequ-entially with the bandit selecting at each round \(t\in [T]\) an action \(a\in [k]\) and sampling a reward \(R_t\) distributed accordingly to \(P_a\). The expected cumulative regret is defined as

$$\begin{aligned} \text {Regret}_T^{\nu ,\pi } = \sum _{t=1}^T \max _{a\in [k]} \mu _a - {{\,{\mathbb {E}}\,}}_{\nu ,\pi } [R_t] , \end{aligned}$$
(8)

where \(\pi \) and \({{\,{\mathbb {E}}\,}}_{\nu ,\pi }\) are both defined analogously from the definitions of the previous section accordingly to this model. It is important to remark that in this setting, the actions are independent, meaning that when the learner samples from one action, then it cannot use this information to learn about other actions.

Using the above model, we describe the multi-armed stochastic bandit studied in Theorem 5.1 in Auer et al. (2003). The bandit is constructed defining an environment \(\nu = (P_a: a\in [k] )\) for \(k\ge 2\) such that \(P_a\) are Bernoulli distributions for all \(a\in [k]\) with outcomes \(\lbrace l_1,l_2 \rbrace \). Then, we set the distributions as follows: we choose an index \(i\in [k]\) uniformly at random and assign \(P_i (R=l_1) = \frac{1+\Delta }{2}\) for some \(\Delta >0\) and \(P_a (R = l_1 ) = \frac{1}{2}\) for \(a\ne i\). Thus, there is a unique best action corresponding to \(a=i\). Then, choosing \(\Delta = \epsilon \sqrt{\frac{k}{n}}\) for some small positive constant \(\epsilon \), for \(n\ge k\), the expected regret for any strategy will scale as

$$\begin{aligned} \text {Regret}_T^{\nu ,\pi } = \Omega (\sqrt{kT} ). \end{aligned}$$
(9)

Theorem 2

Consider a quantum contextual bandit with underlying dimension \(d = 2^n\) and \(n\in \mathbb {N}\), context size \(c\ge 1\) and \(k\ge 2\) actions. Then, for any strategy \(\pi \), there exists a context set \(\mathcal {C}\), \(|\mathcal {C}|=c\), a probability distribution over the context set \(\mathcal {C}\) \(P_\mathcal {C}\) and an environment \(\gamma \in \mathcal {S}_d\) such that for the QCB defined by \((\mathcal {C},\gamma )\), the expected cumulative regret will scale as

$$\begin{aligned} Regret _T^{\gamma , \mathcal {C},\pi } = \Omega \left( \sqrt{kT}\cdot \min \left\{ d,\sqrt{c} \right\} \right) , \end{aligned}$$
(10)

for \(T\ge k\min \lbrace c,d^2 \rbrace \).

Proof

We use a similar technique to Abe et al. (2003); Chu et al. (2011) in order to analyze the regret by dividing the problem into subsets of independent rounds. We start dividing the T rounds in \(c' = \min \lbrace c,d^2-1 \rbrace \) groups of \(T' = \lfloor \frac{T}{c'} \rfloor \) elements. We say that time step t belongs to group s if \(\lfloor \frac{t}{T'} \rfloor = s\). We construct a context set \(\mathcal {C}\) by picking a set of \(c'\) distinct Pauli observables (which is possible since the maximum number of independent Pauli observables is \(d^2-1 \ge c'\)), so \( \mathcal {C} = \lbrace \sigma _i \rbrace _{i=1}^{c'}\). Recall that a Pauli observable is a n-fold tensor product of the 2\(\times \)2 Pauli matrices; thus, the reward will be a binary outcome \(r_t\in \lbrace -1,1\rbrace \). Then, the context distribution works as follows: at each group s of rounds, the learner will receive a different context \(\sigma _s\in \mathcal {C}\), so at group s, the learner only receives \(\sigma _s\).

We want to build an environment such that for each group of rounds \(s\in [c' ]\), all probability distributions are uniform except one that is slightly perturbed. We associate each Pauli observable \(\sigma _i\) to one unique action \(a\in [k]\), and we do this association uniformly at random (each action can be associated with more than 1 Pauli observable). Then, each action \(a\in [k]\) will have \(\lbrace { \sigma _{a,1},...,\sigma _{a,n_a} \rbrace }\) associated Paulis observables, and we can construct the following environment \(\gamma = \lbrace \rho _a \rbrace _{a=1}^{k}\) where

$$\begin{aligned} \rho _a = \frac{I}{d} + \sum _{j=1}^{n_a} \frac{\Delta }{d}\sigma _{a,j}, \end{aligned}$$
(11)

\(n_a \in \left\{ 0,1,...,d^2 -1 \right\} \), \(\sum _{a=1}^k n_a = c'\) and \(\Delta \) is some positive constant. For every group \(s\in [c' ]\), the learner will receive a fixed context \(\sigma _s \in \mathcal {C}\), and there will be a unique action \(a'\) with \(P_{\rho _a'}(1 | A_t =a',s) =\frac{1}{2}+\frac{\Delta }{2}\) (probability of obtaining \(+1\)), and the rest \(a\ne a'\) will have \(P_{\rho _a}(1 | a,s) = \frac{1}{2}\) (uniform distributions). Thus, using that the contexts are independent (\({{\,\textrm{Tr}\,}}(\sigma _i\sigma _{j})\) for \(i\ne j\)), we can apply (9) independently to every group s and we obtain a regret lower bound \( \Omega ( \sqrt{T'k} ) =\Omega ( \sqrt{\frac{Tk}{c'}} ) \). Note that in order to apply (9), we need \(T' \ge k\) or equivalently \(T\ge c'k\). Thus, summing all the \(c' \) groups, we obtain the total regret scales as

$$\begin{aligned} \text {Regret}_T^{\gamma , \mathcal {C},\pi } = \Omega \left( c' \sqrt{\frac{Tk}{c'}} \right) = \Omega \left( \sqrt{kT}\cdot \min \left\{ d,\sqrt{c} \right\} \right) . \end{aligned}$$
(12)

\(\square \)

4 Algorithm

In this section, we review the linear model of multi-armed stochastic bandits and one of the main classical strategies that can be used to minimize regret in this model and also in the QCB model.

4.1 Linear disjoint single context bandits and QCB

The classical setting that matches our problem is commonly referred to as linear contextual bandits (Chu et al. 2011), although it has received other names depending on the specific setting such as linear disjoint model (Li et al. 2010) or associative bandits (Auer 2003). The setting that we are interested in uses discrete action sets, and optimal algorithms are based on upper confidence bounds (UCB). While these algorithms use the “principle of optimism in the face of uncertainty,” there are other approaches like a Thompson sampling (Agrawal and Goyal 2013) algorithm, but they are not optimal for discrete action sets. We use the contextual linear disjoint bandit model from Li et al. (2010) where each action \(a\in [k]\) has an associated unknown parameter \(\theta _a\in \mathbb {R}^d\) and at each round t the learner receives a context vector \(\textbf{c}_{t,a}\in \mathbb {R}^d\) for each action. Then, after selecting an action \(a\in [k]\), the sampled reward is

$$\begin{aligned} R_t = \varvec{\theta }^\top _a \textbf{c}_{t,a} + \eta _t, \end{aligned}$$
(13)

where \(\eta _t\) is some bounded subgaussian noise such that \(\mathbb {E} [R_{t} |A_t = a] = \varvec{\theta }_a \cdot \textbf{c}_{t,a}\). Recall that a random variable X is called \(\sigma \)-subgaussian if for al \(\mu \in \mathbb {R}\) we have \(\mathbb {E} \left[ \mu X \right] \le \exp \left( \sigma ^2 \mu ^2 / 2 \right) \).

In order to map the above setting to the d-dimensional QCB model \((\gamma ,\mathcal {C})\), it suffices to consider a vector parametri-zation (similarly done for the MAQB (Lumbreras et al. 2022)). We choose a set \(\lbrace \sigma _i \rbrace _{i=1}^{d^2}\) of independent Hermitian matrices and parametrize any \(\rho _a \in \gamma \) and \(O_l \in \mathcal {C}\) as

$$\begin{aligned} \rho _a = \sum _{i = 1}^{d^2}\theta _{a,i} \sigma _i , \quad O_l = \sum _{i=1}^{d^2} c_{l,i} \sigma _i , \end{aligned}$$
(14)

where \(\theta _{a,i} = {{\,\textrm{Tr}\,}}(\rho _a \sigma _i ) \) and \(c_{l,i} = {{\,\textrm{Tr}\,}}(O_c \sigma _i) \) and we define the vectors \(\varvec{\theta }_a = (\theta _{a,i})_{i=1}^{d^2} \in \mathbb {R}^{d^2}\) and \(\textbf{c}_l = (c_{a,i})_{i=1}^{d^2} \in \mathbb {R}^{d^2}\). Then, we note that for the QCB model, the rewards will be given by Eq. 13 with the restriction that since we only receive one observable at each round, then the context vector is constant among all actions. Thus, in our model, the rewards have the following expression:

$$\begin{aligned} R_t = \varvec{\theta }_a^\top \textbf{c}_{t} + \eta _t. \end{aligned}$$
(15)

We denote this classical model as linear disjoint single context bandits. In order to make clear when the classical real vectors parametrize an action \(\rho _a\in \gamma \) or context \(O_l \in \mathcal {C}\), in Eq. 14, we will use the notation \(\varvec{\theta }_{\rho _a}\) and \(\textbf{c}_{O_l}\) respect to the standard Pauli basis.

4.2 Linear upper confidence bound algorithm

Now, we discuss the main strategy for the linear disjoint single context model (15) that is the LinUCB (linear upper confidence bound) algorithm (Auer 2003; Chu et al. 2011; Varsha et al. 2008; Rusmevichientong and Tsitsiklis 2010; Abbasi-Yadkori et al. 2011). We describe the procedure of LinUCB for selecting an action, and we leave for the next section a complete description of the algorithm for the QCB setting.

At each time step t, given the previous rewards \(R_1,...,\) \(R_{t-1} \in \mathbb {R}\), selected actions \(a_1,...,a_{t-1} \in [k]\), and observed contexts \(\textbf{c}_1,..., \textbf{c}_t \in \mathbb {R}^d\), the LinUCB algorithm builds the regularized least squares estimator for each unknown parameter \(\varvec{\theta }_a\) that have the following expression:

$$\begin{aligned} \tilde{\varvec{\theta }}_{t,a} = V_{t,a}^{-1} \sum _{s=1}^{t-1} R_s {\textbf {c}}_{s} \mathbb {I} \lbrace a_t = a \rbrace , \end{aligned}$$
(16)

where \(V_{t,a} = I + \sum _{s=1}^{t-1} {\textbf {c}}_{s} {\textbf {c}}^\top _{s} \mathbb {I} \lbrace a_t = a \rbrace \). Then, LinUCB selects the following action according to

$$\begin{aligned} a_{t+1} = {\arg \!\max _{a\in [k]}} \tilde{\varvec{\theta }}^\top _{t,a} \textbf{c}_{t} + \alpha \sqrt{ \textbf{c}^\top _{t} {V^{-1}_{t}} \textbf{c}_t } , \end{aligned}$$
(17)

where \(\alpha > 0\) is a constant that controls the width of the confidence region on the direction of \({\textbf {c}}_{t,a}\). The idea behind this selection is to use an overestimate of the unknown expected value using an upper confidence bound. This is the principle behind UCB (Lai 1987) which is the main algorithm that gives rise to this class of optimistic strategies. The value of the constant \(\alpha \) is chosen depending on the structure of the action set. In the next section, we will discuss the appropriate choice of \(\alpha \) for our setting. We note that this strategy can also be applied in an adversarial approach where the context is chosen by an adversary instead of sampled from some probability distribution.

The above procedure is shown to be sufficient for practical applications (Li et al. 2010), but the algorithms achieving the optimal regret bound are SupLinRel (Auer 2003) and BaseLinUCB (Chu et al. 2011). They use a phase elimination technique that consists of each round playing only with actions that are highly rewarding, but still, the main subroutine for selecting the actions is LinUCB. This technique is not the most practical for applications, but it was introduced in order to derive rigorous regret upper bounds. For these strategies, if we apply it to a d-dimensional QCB bandit \((\gamma ,\mathcal {C})\), they achieve the almost optimal regret bound of

$$\begin{aligned} \text {Regret}_T^{\gamma , \mathcal {C},\pi } = O \left( d\sqrt{kT\ln ^3(T^2\log (T))} \right) . \end{aligned}$$
(18)

The above bound comes from Chu et al. (2011), and it is adapted to our setting using the vector parametrization (14). Our model uses a different unknown parameter \(\varvec{\theta }_a \in \mathbb {R}^d\) for each action \(a\in [k]\). This model can be easily adapted to settings where they assume only one unknown parameter shared by all actions if we enlarge the vector space and define \(\varvec{\theta } = (\theta _1,..., \theta _k ) \in \mathbb {R}^{dk}\) as the unknown parameter. Their regret analysis works under the normalization assumptions \(\Vert \varvec{\theta } \Vert _2 \le 1\), \(\Vert \textbf{c}_t \Vert _2 \le 1\) and the choice of \(\alpha = \sqrt{\frac{1}{2} \ln (2T^2 k ) } \). We note that it matches our lower bound (10) except for the logarithmic terms. While we deal with arbitrarily large qubit systems, because of the choice of the families of Hamiltonians, the dimension the algorithm works in (which we define as \(d_\text {eff}\)) is much smaller than the dimension of the entire Hilbert of space of these multi-qubit systems. This is discussed in detail in the next section. This dimension is sufficiently less, and the previous argument regarding the upper and lower bounds holds. It is important to note that this effective dimension is less than the entire Hilbert space in these chosen families, but not in general sets of Hamiltonians.

5 Low energy quantum state recommender system

In this section, we describe how the QCB framework can be adapted for a recommender system for low-energy quantum states. We consider a setting where the learner is given optimization problems in an online fashion and is able to encode these problems into Hamiltonians and also has access to a set of unknown preparations of (mixed) quantum states that they want to use in order to solve these optimization problems. The task is broken into several rounds; at every round, they receive an optimization problem and are required to choose the state that they will use for that problem. As a recommendation rule, we use the state with the lowest energy with respect to the Hamiltonian where the optimization problem is encoded. We denote this problem as the low energy quantum state recommendation problem. We note that our model focuses on the recommendation following the mentioned rule. After selecting the state, the learner will use it for the optimization problem (for example, the initial ansatz state of a variational quantum eigensolver), but that is a separate task. When a learner chooses an action, they must perform an energy measurement using the given Hamiltonian on the state corresponding to the chosen action. Then, the measurement outcomes are used to model rewards, and their objective is to maximize the expected cumulative reward, i.e., the expectation on the sum of the measurement outcomes over all the rounds played. Any Hamiltonian can be written as a linear combination of Pauli observables, and we use this property to perform measurements. Now, by measuring each of these Pauli observables (since these measurements are conceivable, Peruzzo et al. (2014)), and taking the appropriately weighted sum of the measurement outcomes, we can simulate such a measurement. The QCB framework naturally lends itself to this model, where the Hamiltonians are the contexts, and the set of states that can be prepared reliably serve as the actions.

In this paper, we study some important families of Hamiltonians—specifically, the Ising and a generalized cluster model from Verresen et al. (2017), which are linear combinations of Pauli observables with nearest-neighbor interactions, and for n qubits can be written as

$$\begin{aligned}&H_\text {ising}(h)=\sum _{i=1}^{n}(Z_iZ_{i+1}+hX_i),\end{aligned}$$
(19)
$$\begin{aligned}&H_\text {cluster}(j_1,j_2)=\sum _{i=1}^{n} (Z_i-j_iX_iX_{i+1}-j_2X_{i-1}Z_iX_{i+1}), \end{aligned}$$
(20)

where \(h,j_1,j_2 \in \mathbb {R}\). In the Ising model, h corresponds to the external magnetic field. Specifically, we consider QCB with the following context sets

$$\begin{aligned}&\mathcal {C}_{\text {Ising}} = \left\{ H_\text {ising}(h) : h\in \mathbb {R} \right\} , \nonumber \\&\mathcal {C}_{\text {cluster}} = \left\{ H_\text {cluster}(j_1,j_2) : j_1,j_2\in \mathbb {R} \right\} . \end{aligned}$$
(21)

Important families of Hamiltonians like the models discussed above show translation-invariance and are spanned by Pauli observables showing nearest-neighbor interactions, and as a result, span a low-dimensional subspace. We illustrate the scheme described above through the example of the Ising Model contexts. The Pauli observables that need to be measured are \(\lbrace X_i\rbrace _{ i \in [n]}\) and \(\lbrace Z_iZ_{i+1} \rbrace _{i \in [n]}\). These observables have 2 possible measurement outcomes, -1 and 1, and by the reward distribution of a Pauli observable M given by Born’s rule (2) on a quantum state \(\rho \), the reward can be modeled as

$$\begin{aligned} R_{M,\rho }= 2\text {Bern}\left( \frac{{{\,\textrm{Tr}\,}}(M\rho )+1}{2}\right) -1 , \end{aligned}$$
(22)

where \(\text {Bern}(x) \in \lbrace 0, 1 \rbrace \) is a random variable with Bernoulli distribution with mean \(x\in [0,1]\). By performing such a measurement for all the Pauli observables and adding the rewards, the reward for \(\mathcal {C}_\text {Ising}\) is

$$\begin{aligned} R_{\text {Ising}}=-h\sum _{M \in {X_i,i \in [n]}}R_{M,\rho } -\sum _{M' \in {Z_iZ_{i+1},i \in [n]}} R_{M',\rho }, \end{aligned}$$
(23)

where we took the negative of the sum of the measurements because we are interested in a recommender system for the lowest energy state. A similar formulation applies to the QCB with generalized cluster Hamiltonian contexts.

In the rest of this section, we illustrate a modified LinUCB algorithm for the QCB setting. Then, we implement this recommender system, where the contexts are Hamiltonians belonging to the Ising and generalized cluster models (21), and demonstrate our numerical analysis of the performance of the algorithm by studying the expected regret. We also demonstrate that depending on the action set, the algorithm is able to approximately identify the phases of the context Hamiltonians.

5.1 Gram-Schmidt method

Similarly to the task of shadow tomography (Aaronson 2018) and classical shadows (Huang et al. 2020), we do not need to reconstruct the full quantum states since the algorithm has only to predict the trace between the contexts and the unknown quantum states. Thus, the LinUCB algorithm has only to store the relevant part of the estimators for this computation. As the measurement statistics depend only on the coefficient corresponding to the Pauli observables spanning the observables in the context set, only those Pauli observables in the expansion of the estimators are relevant. This means that our algorithm can operate in a space with a smaller dimension than the entire spaces spanned by n qubits, which has a dimension that is exponential in the number of qubits. As a side note, we would like to remark that one could hope to apply classical shadows to our problem since both tasks seem very similar. However, this is not the case since we need an online method that performs the best recommendation at each time step t.

In order to exploit this property to improve the space complexity of the LinUCB algorithm, we use the Gram-Schmidt procedure in the following way. At any round, a basis for the vector parameterizations (as shown in Eq. 14) of all the previously received contexts is stored. If the incoming vector parameterization of the context is not spanned by this basis, the component of the vector orthogonal to the space spanned by this set is found by a Gram-Schmidt orthonormalization-like process, and this component is added to the set after normalization. Therefore, at any round, there will be a list of orthonormal vectors that span the subspace of all the vector parameterizations of the contexts received so far, and the size of the list will be equal to the dimension of the subspace, which we call effective dimension, i.e.,

$$\begin{aligned} d_{\text {eff},t}=\text {dim}(\{O_{c_t} \in \mathcal {C}\}: t \in [T]). \end{aligned}$$
(24)

From now on, we will omit the subscript for the time step t and simply denote the effective dimension as \(d_\text {eff}\). Instead of feeding the context vectors directly, for any incoming context vector, we construct \(d_\text {eff}\)-dimensional vectors, whose \(i^{th}\) term is the inner product of the context vector and the \(i^{th}\) basis vector. In case the incoming vector is not spanned by the basis, we first update the list by a Gram-Schmidt procedure (which will result in an addition of another orthonormal vector to the list and an increase in \(d_\text {eff}\) by 1) and then construct a \(d_\text {eff}\)-dimensional vector as described before. This vector is fed to the LinUCB algorithm. The Gram-Schmidt procedure is stated in Algorithm 1, and the modified LinUCB algorithm is stated explicitly in Algorithm 2. The efficiency of this method is well illustrated in the case where all the contexts are local Hamiltonians. As an example, we discuss the case of generalized cluster Hamiltonians. Note that the space complexity of the standard QCB framework is \(O(kd^2)\), where k is the number of actions, and d is the dimension of the vector parameterizations of the contexts. In the standard LinUCB technique, the context vectors \(\mathbf {c_t},t \in [T]\) would be \(4^n\)-dimensional, where n is the number of qubits the Hamiltonian acts on, in which case the space complexity of the algorithm is \(O(k4^{2n})\). In our studies, the contexts are Ising Hamiltonians and a generalized cluster Hamiltonian (21) with \(d_\text {eff}\le 2\) and \(d_\text {eff}\le 3\), respectively. Since the vectors fed into the modified LinUCB are \(d_\text {eff}\)-dimensional, the space complexity is \(O(kd_\text {eff}^2)\), i.e., O(4k) and O(9k), respectively.

Algorithm 1
figure a

Gram-Schmidt Algorithm (\(\text {Gram}(\textbf{c}, {V_{a}},\textbf{b}_a,\text {CBasis})\)).

Algorithm 2
figure b

LinUCB with Gram-Schmidt.

5.2 Phase classifier

In order to implement the numerical simulations, we need to choose the environments for the QCB with context sets \(\mathcal {C}_\text {ising}\) and \(\mathcal {C}_\text {cluster}\). Elements of both context sets are parameterized by tunable parameters. We study the performance of the recommender system by choosing a context probability distribution that is uniform (over a chosen finite interval on the real line) on these parameters. Then, we chose the actions as ground states of Hamiltonians that corresponded to the limiting cases (in terms of the parameters) of these models. In order to study the performance of our strategy apart from the expected regret (6), we want to observe how the actions are chosen. For every action, we maintained a set, which contained all the Hamiltonians for which that action was chosen. We observed that almost all the elements in each of these sets belonged to the same phase of the Hamiltonian models.

In order to study the performance of the algorithm in this respect, we define the classifier regret as

$$\begin{aligned} \text {ClassifierRegret}_{T}^{\mathcal {\gamma ,C},\pi }=\sum _{t=0}^{T-1}\mathbb {I}\left[ a_t \ne a_{\text {optimal},\text {t}} \right] , \end{aligned}$$
(25)

where \(a_{\text {optimal},\text {t}}=\text {argmax}_{a \in [k]}\textrm{Tr}\left( O_{t}\rho _a\right) \), and \(O_t \in \mathcal {C}\) is the context observable received in \(t^\text {th}\) round. Note that the above classifier regret is not guaranteed to be sublinear like expected regret is Eq. 18 for the LinUCB strategy. This can be understood intuitively: consider a scenario where the bandit picks an action with a small sub-optimality gap (7); then, the linear regret will increase by a very small amount, the classifier regret will increase by one unit, as all misclassifications have equal contribution to regret. These, however, are theoretic worst-case scenarios, and this classifier regret is useful to study the performance of the algorithm in practice in our settings.

Fig. 2
figure 2

Plots for regret and classifier regret for QCB bandit \((\gamma ,\mathcal {C})\), where the Hamiltonians in \(\mathcal {C}\) are a specific form of generalized cluster models acting on 10 and 100 qubits, respectively. The performance is not very different since \(d_\text {eff}=3\) (24) for both cases. The action set is chosen to be approx. ground states of some generalized cluster Hamiltonians

5.3 Numerical simulations

Before we move into the specific cases, we note the importance of the choice of \(\alpha \) in Algorithm 2. While the theoretical analysis of the LinUCB algorithm depends on the choice of \(\alpha \), in practice, one can tune this value to observe a better performance. We primarily use the \(\alpha \) described in Lattimore and Szepesvári (2020) (Chapter 19) given by

$$\begin{aligned} \alpha _t= m + \sqrt{2\log \left( \frac{1}{\delta } \right) +d\log \left( 1+\frac{tL^2}{d} \right) }. \end{aligned}$$
(26)

Here, L and m are upper bounds on the 2-norm of the action vectors and unknown parameter, respectively, d is the dimension, and \(\delta \) is once more a probability of failure.

Finally, while we study the performance of our algorithm in our simulations with estimates of expected regret and expected classifier regret, it is important to note that in an experimental setup, the learner will only be able to measure the cumulative reward at every round. However, since these are simulations, we are able to study the regret as well, as they are standard metrics to gauge the performance of the algorithms. In the next subsection, we discuss our simulations of the QCB bandit \((\gamma ,\mathcal {C}_\text {cluster})\) model, and later, the QCB bandit \((\gamma ,\mathcal {C}_\text {Ising})\) is discussed in the Appendix 7.

5.3.1 Generalized cluster model

We study the performance of the recommender system for the QCB bandit \((\gamma ,\mathcal {C}_\text {cluster})\), where the generalized cluster Hamiltonians (Verresen et al. 2017) act on 10 qubits and 100 qubits, respectively. We observe that the performance of the algorithm is not affected by the number of qubits, as the effective dimension of the context set remains unchanged, i.e., \(d_\text {eff}=3\). We study the expected regret and classifier regret for these two cases and illustrate the system’s performance in finding the phases of the generalized cluster Hamiltonians. This model was also studied in Caro et al. (2022), where they designed a quantum convolutional neural network to classify quantum states across phase transitions. We chose 5 actions corresponding to approximate ground states of Hamiltonians that are the limiting cases of the generalized cluster model, i.e., generalized cluster Hamiltonians with parameters \(j_1,j_2\) in Eq. 20, \(j_1,j_2\rightarrow \lbrace 0,0 \rbrace ,\lbrace 0,\infty \rbrace ,\lbrace \infty ,0 \rbrace ,\lbrace 0,-\infty \rbrace \)and \(\lbrace 0,-\infty \rbrace \). Note that these methods of approximating ground states are only for simulation purposes.

Fig. 3
figure 3

These plots illustrate how the recommender system identifies the phases of the generalized cluster Hamiltonian. The x and y-axis represent the coupling coefficients of the generalized cluster Hamiltonian received as context. Like the Ising model simulations, we associate a color to each action. For any context, \(H_\text {cluster}(j_1,j_2)\) corresponding to any of the T rounds, one of these actions is picked by the algorithm. We plot the corresponding colored dot (blue for the ground state of \(H_\text {cluster}(-\infty ,0)\), orange for \(H_\text {cluster}(0,\infty )\), red for \(H_\text {cluster}(\infty ,0)\), green for \(H_\text {cluster}(0,-\infty )\), and purple for \(H_\text {cluster}(0,0)\)) at the appropriate coordinates for rounds that follow after the bandit has “learned” the actions, i.e., the growth in regret has slowed down. The lines in black represent the actual phase diagram

Initially, a steep growth in regret is observed, followed by a sudden slower pace. On looking closely, in the plot Fig. 3, we find that the regret indeed continues to grow, albeit at a slower pace. This was explained by observing that the sub-optimality gap of the second-best action is quite small in comparison to the sub-optimality gaps of the rest of the actions. Initially, the LinUCB algorithm does not have enough information about the unknown parameters and has to play all actions resulting in an exploration phase. However, at some point, the bandit recognizes the “bad” actions and plays either the best action or the action with a small sub-optimality gap most of the time—this is when the bandit has begun to balance exploration and exploitation. This is illustrated by observing the growth of the regret before and after the first 50 rounds in the insets of Fig. 2. At the beginning of this subsection, we mentioned that the recommendation system picks the same action for context Hamiltonians belonging to the same phase. We illustrate this in Fig. 3. In the scatter plot, when a context generalized cluster Hamiltonian is received, a dot is plotted with the x-axis and y-axis coordinates corresponding to its parameters \(j_1,j_2\), respectively. Depending on the action picked by the algorithm, we associate a color with the dot. The resultant plot is similar (but not exact) to the phase diagram of the generalized cluster Hamiltonian. We stress that the algorithm that we are using for the below plots is the same as the one for Fig. 2, and we are just plotting a visual representation in terms of the phases. We do not expect that this method can compete with other phase classifiers if the algorithms were designed for such tasks as the one in Huang et al. (2021).

6 Outlook

This work describes the first steps for recommending quantum data by implementing the bandit framework in a rigorous fashion for practical scenarios. We provide a recommender system based on the theory of linear contextual bandits and show that the upper and lower bounds on the performance meet asymptotically. We also demonstrate its efficiency in practice through simulations. Later, we show how such a system could also be used to recognize phases of Hamiltonians.

We restricted our attention to a model where the expected rewards follow a linear function in terms of the context and the unknown states. While the low energy quantum state recommendation problem uses the outcome of the measurement as a reward, one could think of other recommendation tasks with more complicated reward functions. Non-linear rewards have been studied in the bandit literature and receive the name of structured bandits (Lattimore and Munos 2014; Combes et al. 2017; Russo and Roy 2013). This model could be a natural extension of the QCB for other recommendation tasks where the rewards are not in one-to-one correspondence with measurement outcomes. Going back to the general model, the environment is modeled by a set of unknown quantum processes, which in the QCB model, we assumed to be a set of stationary unknown quantum states. In a more general scenario, we can consider environments that change with time due to some Hamiltonian evolution or noise interaction with an external environment. In the bandit literature, non-stationary environments were first considered in Gittins (1979); Gittins et al. (2011), where each action was associated with a Markov chain or the restless bandit model (Whittle 1988) where the Markov chain associated to each action evolves with time. More recently, in Luo et al. (2018), they studied a contextual bandit model with non-stationary environments. We expect that recommender systems for quantum data can also be extended to similar settings.