1 Introduction

Inhomogeneous Poisson process data defined on a continuous spatio-temporal domain has attracted immense attention recently in a wide variety of applications, including reliability analysis in manufacturing systems (Soleimani et al., 2017), event capture in sensing regions (Mutny and Krause, 2021), crime prediction in urban area (Shirota and Gelfand, 2017) and disease diagnosis based on medical records (Lasko, 2014). The reliable training of an inhomogeneous Poisson process model critically relies on a large amount of data to avoid overfitting, especially when modeling high-dimensional point processes. However, one challenge is that the available training data is routinely sparse or even partially missing in specific applications. Taking manufacturing failure and healthcare analysis as motivating examples: the modern manufacturing machines are reliable and sparsely fail; the individuals with healthy constitution will not visit hospital very often. The data missing problems also arise, e.g., the event location capture is intermittent for sensing systems because of weather or other related barriers. To handle data sparse/missing problems, the correlation between multiple tasks can be exploited to facilitate sharing of information between all tasks to improve the generalization capabilities, forming a multi-task learning paradigm.

A popular approach to modeling multi-task inhomogeneous Poisson processes is to use Gaussian process (GP) (Williams & Rasmussen, 2006) based Bayesian framework to induce correlation among tasks. This kind of multi-task inhomogeneous Poisson processes are also called multi-task Cox processes (Møller et al., 1998). Multi-task Cox processes have been investigated extensively in recent years, e.g., hierarchical-GP based version (Lian et al., 2015) and multi-output Gaussian processes (MOGP) based versions (Aglietti et al., 2019; Jahani et al., 2021). Yet to our knowledge, all the aforementioned works focus on homogeneous multi-task Cox processes learning, i.e., all correlated tasks are exclusively point process tasks. It is not free to apply them to the more general heterogeneous multi-task scenarios where correlated tasks include other types of tasks except Cox processes. Take the urban data of Vancouver in Fig. 3 as a motivating example where we have three types of tasks: employment income (regression), education degree (classification), theft of vehicle (Cox process) and non-market house (Cox process). When the crime data is missing in certain areas of the city, training on this single task is prone to overfitting since the model may try to fit the available data too closely, leading to inaccurate predictions or poor generalization to unseen data. Can we leverage the information of employment income, education degree and non-market housing to assist the prediction of crime rate in the missing areas? Or, can we make use of income, education and crime to help predict the number of non-market housing projects in certain missing areas? Based on our knowledge, only a few heterogeneous frameworks exist, such as Moreno-Muñoz et al. (2018). However, Moreno-Muñoz et al. (2018) discretized the point process task into Poisson distribution problems and does not preserve conjugate operations. To make further progress, we generalize the homogeneous multi-task Cox processes to the heterogeneous setup using MOGP to enable the transfer of knowledge between supervised (regression and classification) and unsupervised tasks (Cox processes).

Most existing Cox process works focus on the log Gaussian Cox process (LGCP) (Møller et al., 1998) where a GP function is passed through an exponential link function to model the positive intensity rate. Due to the nonconjugacy between point process likelihood and GP prior, practitioners need to apply Markov chain Monte Carlo (MCMC) (Neal, 1993) or variational inference (Blei et al., 2017) methods to infer the posterior distribution of model parameters. For MCMC, the specialised MCMC algorithms, such as Metropolis-adjusted Langevin algorithm (MALA) (Møller et al., 1998; Besag, 1994), as well as the probabilistic programming languages based on MCMC (Wood et al., 2014) where one does not need to write a sampler by hand, can be used for sampling from the posterior of intensity function. Although MCMC provides the guarantee of asymptotic consistency, this accuracy comes at the expense of a high computational cost. On the contrary, variational inference can be faster than MCMC, although it induces approximation error. For the efficiency reason, we focus on variational inference in this work. For variational inference, a Gaussian variational posterior is typically assumed to render the evidence lower bound (ELBO) tractable (Dezfouli & Bonilla, 2015; Lloyd et al., 2015). While this variational inference method is quite generic, it can exhibit low efficiency (although it is still faster than MCMC) (Wenzel et al., 2019), exposing opportunities for improvement. It is worth noting that the same problem also occurs in GP classification tasks. This work remediates these issues by basing our model on sigmoidal Gaussian Cox process (SGCP) (Adams et al., 2009), using a scaled sigmoid function as link function in point process tasks, and the logistic regression model in classification tasks. The reason we choose sigmoid as link function in both types of tasks is we can exploit the data augmentation technique (Polson et al., 2013; Donner & Opper, 2018) to construct a mean-field approximation that has closed-form iterative updates. As shown later, the proposed mean-field approximation exhibits superior efficiency and fast convergence.

Specifically, we make the following contributions. (1) From a modeling perspective, we establish a MOGP based heterogeneous multi-task Gaussian Cox processes (HMGCP) model that provides an extension of the homogeneous version to account for multiple heterogeneous correlated tasks. (2) From an inference perspective, we adopt the data augmentation technique to derive an efficient mean-field approximation with analytical expressions. As far as we know, this work should be the first attempt to use data augmentation in the MOGP setting. (3) In experiments, we provide evidence of the benefits of modeling heterogeneous correlated tasks and the predominant efficiency and convergence of our inference method.

2 Related work

2.1 Multi-output Gaussian processes

Multi-output Gaussian processes (Álvarez et al., 2012) extend the single-output Gaussian process to model vector-valued functions, providing a powerful Bayesian tool for multi-task learning as it accounts for the correlation between multiple outputs. Bonilla et al. (2007) has shown that if multiple outputs are correlated, exploiting such correlation can provide insightful information about each output and better predictions in the case of sparse/missing data. More importantly, as a Bayesian nonparametric approach, it offers higher flexibility over parametric alternatives and a natural mechanism for uncertainty quantification. To define a MOGP, we need to define a suitable cross-covariance function that accounts for the correlation between multiple outputs, which leads to a valid covariance function for the joint GP (Álvarez et al., 2019). The two common ways to define cross-covariance functions are linear model of coregionalization (LMC) (Journel & Huijbregts, 1976) and process convolution (Ver Hoef & Barry, 1998). In this work, we focus on the LMC approach.

2.2 Multi-task Cox processes

Extensive works have been accumulated on the single-task Gaussian Cox process (Møller et al., 1998; Diggle et al., 2013). Recently, many works tried to extend the single-task Cox process to the multi-task setup to introduce correlation between tasks. For example, Lian et al. (2015) proposed a multi-task Cox process model that leverages information from all tasks via a hierarchical GP. In a different way, Aglietti et al. (2019) and Jahani et al. (2021) adopted the MOGP based on LMC and process convolution respectively to model the intensity functions of multiple Cox processes, which facilitates sharing of information and allows for flexible event occurrence rate. All these works exclusively focus on homogeneous multi-task Cox processes. On the contrary, we extend to the heterogeneous scenarios to enable transfer of knowledge between Cox process, regression and classification tasks.

2.3 Data augmentation

In GP regression, the conjugacy between likelihood and prior makes the posterior computing easy and closed-form. However, in GP classification and point process, such conjugacy no longer holds and one may resort to variational inference to approximate the true posterior. Most generic non-conjugate variational inference, assuming a Gaussian variational posterior to make the ELBO tractable, exhibits low efficiency due to computing of expectations (Dezfouli & Bonilla, 2015). Recently, another inference method based on data augmentationFootnote 1 has been established for GP classification (Polson et al., 2013; Wenzel et al., 2019) and point process (Donner & Opper, 2018; Zhou et al., 2020; Zhou et al., 2021; Zhou et al., 2022). The core idea is to augment likelihood by auxiliary latent variables to convert the non-conjugate problem to a conditionally conjugate one, thus making inference easy (Li et al., 2014). Here, such an idea is extended to the MOGP modulated multi-task framework.

3 Problem formulation

Traditionally, existing works have considered the homogeneous multi-task Cox processes learning where all tasks are Cox processes (Aglietti et al., 2019; Jahani et al., 2021). The homogeneous model is not applicable to the more general heterogeneous scenario which includes various types of tasks except Cox processes. In this work, we are interested in the more general heterogeneous scenario where correlated tasks are a mix of supervised (regression and classification) and unsupervised tasks (Cox processes). Let us consider a problem setting where we have data from I tasks, among which \(I_r\) tasks are regression problems with dataset \(\mathcal {D}_r=\{\{(\textbf{x}_{i,n}^r,y_{i,n}^r)\}_{n=1}^{N_i^r}\}_{i=1}^{I_r}\), \(I_c\) tasks are classification problems with dataset \(\mathcal {D}_c=\{\{(\textbf{x}_{i,n}^c,y_{i,n}^c)\}_{n=1}^{N_i^c}\}_{i=1}^{I_c}\) and \(I_p\) tasks are point process problems with dataset \(\mathcal {D}_p=\{\{(\textbf{x}_{i,n}^p)\}_{n=1}^{N_i^p}\}_{i=1}^{I_p}\). \(\textbf{x}\in \mathcal {X}\subset \mathbb {R}^D\) is the D-dimensional input; \(y\in \mathbb {R}\) is the output in regression tasks and \(\{-1,1\}\) in classification tasks.Footnote 2 Point process tasks are unsupervised learning problems so they only include \(\textbf{x}\). Throughout the paper, we use index rcp to indicate regression, classification and point process tasks, respectively.

3.1 Heterogeneous likelihood

In order to use GP to represent the likelihood parameters in three types of tasks, we need to design the appropriate transformation to map the GP output to the domain of specific parameters. For regression tasks, following tradition, we use Gaussian distribution as likelihood, where the mean is modeled as a GP function and the variance is treated as a hyperparameter. For binary classification tasks, we use Bernoulli distribution (Uspensky, 1937) as likelihood whose parameter is modeled by the sigmoid transformation of a GP function, mapping \(\mathbb {R}\rightarrow [0,1]\), which is also called logistic regression. For Cox process tasks, although many existing works focus on LGCP, our work adopts the SGCP instead, i.e., the intensity of i-th Cox process is assumed to be \(\lambda _i(\textbf{x})=\bar{\lambda }_i s(g_i(\textbf{x}))\) where a task-specific GP function \(g_{i}\) is passed through a sigmoid function \(s(\cdot )\) and then scaled by an upper-bound \(\bar{\lambda }_{i}\). The reason we choose the sigmoid link function in both classification and point process tasks is that we can exploit the data augmentation to make inference easy and fast. Specifically, three types of likelihoods are:

$$\begin{aligned}&p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})=\prod _{i=1}^{I_r}\prod _{n=1}^{N_{i}^r}\mathcal {N}(y^r_{i,n}\mid g^r_{i,n},\sigma _i^2), \end{aligned}$$
(1a)
$$\begin{aligned}&p(\textbf{y}^{c}\mid \{g_{i}^c\}_{i=1}^{I_c})=\prod _{i=1}^{I_c}\prod _{n=1}^{N^c_{i}}s(y^c_{i,n}g^c_{i,n}),\end{aligned}$$
(1b)
$$\begin{aligned}&p(\textbf{x}^p\mid \{\bar{\lambda }_i,g^p_i\}_{i=1}^{I_p})=\prod _{i=1}^{I_p}\prod _{n=1}^{N^p_i}\bar{\lambda }_i s(g^p_{i,n})\exp \left( -\int _{\mathcal {X}}\bar{\lambda }_i s(g^p_i(\textbf{x}))d\textbf{x}\right) , \end{aligned}$$
(1c)

where \(g_{i}\) is the task-specific GP function and we call it latent function (Rasmussen, 2003) afterwards; \(g_{i}^r\), \(g_{i}^c\), \(g^p_i\) are the corresponding i-th output of the regression, classification and point process tasks, respectively; \(g^{\cdot }_{i,n}\) indicates \(g^{\cdot }_i(\textbf{x}^{\cdot }_{i,n})\). Equation (1a) is the likelihood for regression; Eq. (1b) is the likelihood for binary classification; Eq. (1c) is the likelihood for point process (Daley and Vere-Jones, 2003).

3.2 MOGP prior

Instead of modeling each \(g_i\) independently, we apply the MOGP prior on g’s to introduce correlation between multiple tasks in order to improve the generalization capability of our model especially when data is sparse or missing. In this work, we use the LMC (Journel & Huijbregts, 1976) approach to define the cross-covariance function. Specifically, we assume each latent function \(g_i\) is a linear combination of Q basis functions which are drawn from Q independent zero-mean GP prior, i.e., \(\{f_q\sim \mathcal{G}\mathcal{P}(0,k_q)\}_{q=1}^Q\) where \(k_q\) is a covariance function. Each latent function can be written as \(g_i=\sum _{q=1}^Qw_{i,q}f_q\) where \(w_{i,q}\in \mathbb {R}\) is the mixing weight capturing the contribution of q-th basis function to i-th latent function. It is easy to see that the mean of \(g_i\) is zero and the cross-covariance \(k_{g_i,g_j}(\textbf{x},\textbf{x}')=\text {cov}[g_i(\textbf{x}),g_j(\textbf{x}')]=\sum _{q=1}^Qw_{i,q}w_{j,q}k_q(\textbf{x},\textbf{x}')\). If we define \(\textbf{g}_i\) to be the vector of latent function values on the inputs of i-th task, we have the following MOGP prior: \(\textbf{g}\sim \mathcal {N}(\textbf{0},\textbf{K})\), where \(\textbf{g}=[\textbf{g}_{1}^\top ,\ldots ,\textbf{g}_{I}^\top ]^\top\), \(I=I_r+I_c+I_p\), \(\textbf{K}\) is a block-wise matrix with blocks given by \(\{\textbf{K}_{\textbf{g}_{i},\textbf{g}_{j}}\}_{i=1,j=1}^{I,I}\) whose entries are \(k_{g_i,g_j}(\textbf{x},\textbf{x}')\). \(\textbf{x}\) and \(\textbf{x}'\) are the inputs of i-th and j-th tasks, respectively. It is worth noting that each task can have a different set of inputs, but when all tasks have the same set of inputs, e.g., \(\textbf{X}\), the computing of \(\textbf{K}\) can be simplified as the sum of Kronecker products \(\textbf{K}=\sum _{q=1}^Q\textbf{w}_q\textbf{w}_q^\top \otimes \textbf{K}_q\) where \(\textbf{w}_q=[w_{1,q},\ldots , w_{I,q}]^\top\), \(\textbf{K}_q\) is the square matrix of \(k_q(\textbf{x},\textbf{x}')\) with \(\textbf{x},\textbf{x}'\in \textbf{X}\) (Moreno-Muñoz et al., 2018). This property cooperates well with the inducing inputs formalism which is discussed later.

4 Inference

According to Bayes’ theorem, the posterior of latent functions and intensity upper-bounds can be computed as:

$$\begin{aligned} \begin{aligned}&p(g,\bar{\varvec{\lambda }}\mid \textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p)\propto \\&\underbrace{p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})}_{\text {regression}}\underbrace{p(\textbf{y}^{c}\mid \{g_{i}^c\}_{i=1}^{I_c})}_{\text {classification}}\underbrace{p(\textbf{x}^p\mid \{\bar{\lambda }_i,g^p_i\}_{i=1}^{I_p})}_{\text {Cox process}}\underbrace{p(g)}_{\text {MOGP}}p(\bar{\varvec{\lambda }}), \end{aligned} \end{aligned}$$

where \(g=[g_1,\ldots ,g_I]^\top\), \(\bar{\varvec{\lambda }}=[\bar{\lambda }_1,\ldots ,\bar{\lambda }_{I_p}]^\top\), p(g) is the infinite-dimensional version of MOGP, \(p(\bar{\varvec{\lambda }})\propto \prod _{i=1}^{I_p}\frac{1}{\bar{\lambda }_i}\) is the improper prior. The likelihood of regression is conjugate to the prior. However, such conjugacy is no longer valid for classification and Cox process tasks, so the posterior has no closed-form solution.

To address the non-conjugate issue for classification or Cox process, many works applied the variational inference that assumed a Gaussian variational distribution to render the ELBO tractable (Dezfouli & Bonilla, 2015; Hensman et al., 2015; Aglietti et al., 2019; Jahani et al., 2021). However, such generic variational inference exhibits low efficiency due to computing of expectations in ELBO (Wenzel et al., 2019). In this work, borrowing the idea of data augmentation, we augment Pólya-Gamma latent variables (Polson et al., 2013) and marked Poisson latent processes (Donner & Opper, 2018) into the likelihood of classification and Cox process. Finally, the augmented likelihood is conditionally conjugate to the MOGP prior. Based on the augmented model, we derive a mean-field approximation with closed-form iterative updates to provide an approximate posterior. The proofs of all relevant formulas below are provided in the appendix.

4.1 Augmentation for classification tasks

Polson et al. (2013) proposed a novel Pólya-Gamma augmentation strategy for Bayesian logistic regression. The core idea is that the binomial likelihood parametrized by log odds can be represented as a mixture of Gaussians w.r.t. a Pólya-Gamma distribution.

If \(\omega \sim p_{\text {PG}}(\omega \mid b,0)\) denotes the Pólya-Gamma random variable with \(\omega \in \mathbb {R}^+\) and \(b>0\), the following integral identity holds for \(a\in \mathbb {R}\):

$$\begin{aligned} \frac{(e^{z})^a}{(1+e^{z})^b}=2^{-b}e^{(a-b/2)z}\int _0^\infty e^{-z^2\omega /2}p_{\text {PG}}(\omega \mid b,0)d\omega . \end{aligned}$$

In this work, we do not need to know the exact form of the Pólya-Gamma distribution, but only its first moment. Setting \(a=b=1\) yields the factorization of sigmoid function:

$$\begin{aligned} s(z)=\frac{e^z}{1+e^z}=\int _0^{\infty } e^{h(\omega ,z)}p_{\text {PG}}(\omega \mid 1,0)d\omega , \end{aligned}$$
(2)

where \(h(\omega ,z)=z/2-z^2\omega /2-\log 2\). Substituting Eq. (2) into the classification likelihood in Eq. (1b), we obtain the augmented classification likelihood which has the elegant conditionally conjugate property. After augmenting Pólya-Gamma random variables, the logistic regression likelihood in Eq. (1b) is augmented to be:

$$\begin{aligned} \begin{aligned}&p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})=\prod _{i=1}^{I_c} \prod _{n=1}^{N^c_i}e^{h(\omega ^c_{i,n},y^c_{i,n}g^c_{i,n})}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,0), \end{aligned} \end{aligned}$$
(3)

where \(\omega ^c_{i,n}\) is the Pólya-Gamma latent variable on the n-th observed sample in the i-th classification task, \(\varvec{\omega }^c_i=[\omega ^c_{i,1},\ldots ,\omega ^c_{i,N_i^c}]^\top\), \(\varvec{\omega }^c=[{\varvec{\omega }^c_1}^\top ,\ldots ,{\varvec{\omega }^c_{I_c}}^\top ]^\top\). The derivation is provided in Section A. The augmented classification likelihood in Eq. (3) is conditionally conjugate to the MOGP prior.

4.2 Augmentation for Cox process tasks

The augmentation for Cox process is more challenging than classification because the Cox process likelihood depends not only on the latent function values on observed samples but also on the whole latent function due to the exponential integral term. Borrowing the idea from Donner and Opper (2018), in addition to augmenting Pólya-Gamma latent variables on observed samples as in classification tasks, we also augment a marked Poisson latent process to linearize the exponential integral term.

Define a marked Poisson process \(\Pi =\{(\textbf{x}_r,\omega _r)\}\sim p(\Pi \mid \bar{\lambda }p_{\text {PG}}(\omega \mid 1,0))\) where \(\textbf{x}_r\) is the location of r-th point, the Pólya-Gamma latent variable \(\omega _r\) denotes the independent mark at each point \(\textbf{x}_r\), \(p(\Pi \mid \bar{\lambda }p_{\text {PG}}(\omega \mid 1,0))\) denotes the probability measure of \(\Pi\) with intensity \(\Lambda (\textbf{x},\omega )=\bar{\lambda }p_{\text {PG}}(\omega \mid 1,0)\). Given the marked Poisson process defined above, the following identity holds:

$$\begin{aligned} \exp {\left( -\int _\mathcal {X}\bar{\lambda }s(g(\textbf{x}))d\textbf{x}\right) }=\mathbb {E}_{p_\Lambda }\prod _{(\omega ,\textbf{x})\in \Pi }e^{h(\omega ,-g(\textbf{x}))}, \end{aligned}$$
(4)

where \(p_\Lambda\) indicates \(p(\Pi \mid \Lambda (\textbf{x},\omega )=\bar{\lambda }p_{\text {PG}}(\omega \mid 1,0))\). Substituting Eqs. (2) and 4 into the Cox process likelihood in Eq. (1c), we obtain the augmented Cox process likelihood which has the conditionally conjugate property. After augmenting the Pólya-Gamma latent variables on observed samples and the marked Poisson latent process, the Cox process likelihood in Eq. (1c) is augmented to be:

$$\begin{aligned} \begin{aligned}&p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})=\\&\prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}\Lambda _{i}(\textbf{x}^p_{i,n},\omega ^p_{i,n})e^{h(\omega ^p_{i,n},g^p_{i,n})}p_{\Lambda _i}(\Pi _i\mid \bar{\lambda }_i)\prod _{(\omega ,\textbf{x})\in \Pi _i}e^{h(\omega ,-g^p_i(\textbf{x}))}, \end{aligned} \end{aligned}$$
(5)

where \(\omega ^p_{i,n}\) is the Pólya-Gamma latent variable on n-th observed sample in the i-th Cox process task, \(\varvec{\omega }^p_i=[\omega ^p_{i,1},\ldots ,\omega ^p_{i,N_i^p}]^\top\), \(\varvec{\omega }^p=[{\varvec{\omega }^p_1}^\top ,\ldots ,{\varvec{\omega }^p_{I_p}}^\top ]^\top\), \(\Lambda _{i}(\textbf{x},\omega )=\bar{\lambda }_ip_{\text {PG}}(\omega \mid 1,0)\), \(\Pi =\{\Pi _i\}_{i=1}^{I_p}\). The derivation is provided in Section B. The augmented Cox process likelihood in Eq. (5) is conditionally conjugate to the MOGP prior.

4.3 Mean-field approximation

Based on the augmented likelihoods for classification and Cox process in Eqs. (3) and (5), we obtain the augmented joint distribution for all variables:

$$\begin{aligned} \begin{aligned}&p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})=\\&\underbrace{p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})}_{\text {regression}}\underbrace{p(\textbf{y}^c,\varvec{\omega }^c\mid \{g^c_i\}_{i=1}^{I_c})}_{\text {augmented classification}}\underbrace{p(\textbf{x}^p,\varvec{\omega }^p,\Pi \mid \bar{\varvec{\lambda }}, \{g^p_i\}_{i=1}^{I_p})}_{\text {augmented Cox process}}\underbrace{p(g)}_{\text {MOGP}}p(\bar{\varvec{\lambda }}). \end{aligned} \end{aligned}$$
(6)

Finally, our efforts are rewarded: after data augmentation, the model likelihood is conditionally conjugate to the prior and a simple Gibbs sampler can be derived to sample from the exact posterior by drawing a sample from each conditional distribution alternately. The samples of latent functions and intensity upper-bounds will be from the true posterior asymptotically. However, the sampling approach has a prohibitive computational cost and does not scale to large datasets. The comparison of efficiency between Gibbs sampler and variational inference is outside of the scope of this paper. Here we adopt the augmented model to derive an efficient mean-field approximation, which has closed-form iterative updates.

Following the standard derivation of mean-field approximation, we assume the posterior \(p(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }}\mid \textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p)\) is approximated by a variational posterior:

$$\begin{aligned} q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})=q_1(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )q_2(g,\bar{\varvec{\lambda }}). \end{aligned}$$

The independence of two sets of variables is the only assumption of the variational posterior. To minimize the Kullback–Leibler (KL) divergence between q and p, it can be proved that the optimal distribution of each factor is the expectation of the logarithm of Eq. (6) taken over variables in the other factor (Bishop, 2006; Blei et al., 2017):

$$\begin{aligned} \begin{aligned} q_1^*(\varvec{\omega }^c,\varvec{\omega }^p,\Pi )&\propto e^{{\mathbb {E}_{q_2}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}},\\ q_2^*(g,\bar{\varvec{\lambda }})&\propto e^{{\mathbb {E}_{q_1}[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p,\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]}}. \end{aligned} \end{aligned}$$
(7)

A prominent weakness of GP is that it suffers from a cubic complexity w.r.t. the number of samples. In multi-task scenario, although the samples in a single task can be few, the total number of samples in all tasks can be large. To make our mean-field approximation scalable, we employ the inducing points formalism (Alvarez and Lawrence, 2008; Titsias, 2009). We denote M inducing inputs \([\textbf{x}_1\,\ldots ,\textbf{x}_M]^\top\) on the domain \(\mathcal {X}\) for each task. The function values of basis function \(f_q\) at these inducing inputs are defined as \(\textbf{f}_{q,\textbf{x}_m}\). Then we can obtain the i-th task latent function \(g_i\) at these inducing inputs \(\textbf{g}_{\textbf{x}_m}^i=\sum _{q=1}^Qw_{i,q}\textbf{f}_{q,\textbf{x}_m}\).Footnote 3 If we define \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}^{1\top }_{\textbf{x}_m},\ldots ,\textbf{g}^{I\top }_{\textbf{x}_m}]^\top\), \(\textbf{g}_{\textbf{x}_m}\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m})\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\) is the MOGP covariance on \(\textbf{x}_m\) for all tasks and \(\textbf{g}_{\textbf{x}_m}^i\sim \mathcal {N}(\textbf{0},\textbf{K}_{\textbf{x}_m \textbf{x}_m}^i)\) where \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}^i\) is i-th diagonal block of \(\textbf{K}_{\textbf{x}_m\textbf{x}_m}\). Given \(\textbf{g}_{\textbf{x}_m}^i\), we assume \(p(g_i(\textbf{x})\mid \textbf{g}_{\textbf{x}_m}^i)=\mathcal {N}(\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{g}_{\textbf{x}_m}^i, k_{\textbf{x}\textbf{x}}^{i}-\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{k}_{\textbf{x}_m \textbf{x}}^{i})\) where \(\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}\) is the kernel w.r.t. inducing points and the predictive point, \(k_{\textbf{x}\textbf{x}}^{i}\) is the kernel w.r.t. the predictive point for i-th task.

After substituting Eq. (6) into Eq. (7) and introducing the inducing points, we can obtain the optimal variational distribution of each factor in the following closed-form expressions (derivation provided in Section C).

4.4 The optimal density of Pólya-Gamma latent variables

The optimal variational posteriors of \(\varvec{\omega }^c\) and \(\varvec{\omega }^p\) are:

$$\begin{aligned} q_1(\varvec{\omega }^c)= & {} \prod _{i=1}^{I_c}\prod _{n=1}^{N_i^c}p_{\text {PG}}(\omega ^c_{i,n}\mid 1,\tilde{g}^c_{i,n}), \end{aligned}$$
(8)
$$\begin{aligned} q_1(\varvec{\omega }^p)= & {} \prod _{i=1}^{I_p}\prod _{n=1}^{N_i^p}p_{\text {PG}}(\omega ^p_{i,n}\mid 1,\tilde{g}^p_{i,n}), \end{aligned}$$
(9)

where \(\tilde{g}^\cdot _{i,n}=\sqrt{\mathbb {E}[{g^\cdot _{i,n}}^2]}\).

4.5 The optimal intensity of marked Poisson processes

The optimal variational posterior intensity of \(\Pi =\{\Pi _i\}_{i=1}^{I_p}\) is:

$$\begin{aligned} \begin{aligned} \Lambda _i^1(\textbf{x},\omega )=\bar{\lambda }_i^1 s(-\tilde{g}^p_i(\textbf{x}))p_{\text {PG}}(\omega \mid 1,\tilde{g}^p_i(\textbf{x}))e^{(\tilde{g}^p_i(\textbf{x})-\bar{g}^p_i(\textbf{x}))/2}, \end{aligned} \end{aligned}$$
(10)

where \(\bar{\lambda }^1_i=e^{\mathbb {E}[\log \bar{\lambda }_i]}\), \(\tilde{g}_i^p(\textbf{x})=\sqrt{\mathbb {E}[{g_i^p(\textbf{x})}^2]}\) and \(\bar{g}_i^p(\textbf{x})=\mathbb {E}[g_i^p(\textbf{x})]\).

4.6 The optimal density of intensity upper-bounds

The optimal variational posterior of \(\bar{\varvec{\lambda }}\) is:

$$\begin{aligned} \begin{aligned} q_2(\bar{\varvec{\lambda }})=\prod _{i=1}^{I_p}p_{\text {Ga}}(\bar{\lambda }_i\mid N_i^p+R_i,|\mathcal {X}|), \end{aligned} \end{aligned}$$
(11)

where \(p_{\text {Ga}}\) is Gamma density, \(R_i=\int _\mathcal {X}\int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega d\textbf{x}\), \(|\mathcal {X}|\) is the domain size.

4.7 The optimal density of latent functions

The optimal variational posterior of \(\textbf{g}_{\textbf{x}_m}\) is:

$$\begin{aligned} q_2(\textbf{g}_{\textbf{x}_m})=\mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{m}_{\textbf{x}_m},\mathbf {\Sigma }_{\textbf{x}_m}), \end{aligned}$$
(12)

where \(\textbf{g}_{\textbf{x}_m}=[\textbf{g}_{\textbf{x}_m}^{r\top },\textbf{g}_{\textbf{x}_m}^{c\top },\textbf{g}_{\textbf{x}_m}^{p\top }]^\top\), \(\textbf{g}_{\textbf{x}_m}^{\cdot }=[\textbf{g}_{1,\textbf{x}_m}^{\cdot \top },\ldots ,\textbf{g}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and

$$\begin{aligned} \mathbf {\Sigma }_{\textbf{x}_m}=\left[ \text {diag}\left( \textbf{H}_{\textbf{x}_m}^r,\textbf{H}_{\textbf{x}_m}^c,\textbf{H}_{\textbf{x}_m}^p\right) +\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{-1}\right] ^{-1},\textbf{m}_{\textbf{x}_m}=\mathbf {\Sigma }_{\textbf{x}_m}[\textbf{v}_{\textbf{x}_m}^{r\top },\textbf{v}_{\textbf{x}_m}^{c\top },\textbf{v}_{\textbf{x}_m}^{p\top }]^\top , \end{aligned}$$

where \(\textbf{H}_{\textbf{x}_m}^\cdot =\text {diag}(\textbf{H}_{1,\textbf{x}_m}^\cdot ,\ldots ,\textbf{H}_{I_\cdot ,\textbf{x}_m}^\cdot )\), \(\textbf{v}_{\textbf{x}_m}^\cdot =[\textbf{v}_{1,\textbf{x}_m}^{\cdot \top },\ldots ,\textbf{v}_{I_\cdot ,\textbf{x}_m}^{\cdot \top }]^\top\) and

$$\begin{aligned} \begin{aligned} \textbf{H}_{i,\textbf{x}_m}^r=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\textbf{D}^r_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{r}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{r,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{r,i}\frac{\textbf{y}^r_i}{\sigma _i^2},\\ \textbf{H}_{i,\textbf{x}_m}^c=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\textbf{D}^c_i\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}, \textbf{v}_{i,\textbf{x}_m}^{c}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{c,i^{-1}}\textbf{K}_{\textbf{x}_m \textbf{x}_n}^{c,i}\frac{\textbf{y}^c_i}{2},\\ \textbf{H}_{i,\textbf{x}_m}^p=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}}A_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i\top }d\textbf{x}\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}},\\ \textbf{v}_{i,\textbf{x}_m}^{p}=\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{p,i^{-1}}\int _{\mathcal {X}}B_i(\textbf{x})\textbf{k}_{\textbf{x}_m \textbf{x}}^{p,i}d\textbf{x}, \end{aligned} \end{aligned}$$

where \(\textbf{D}^r_i=\text {diag}(1/\sigma _i^2)\), \(\textbf{D}^c_i=\text {diag}(\mathbb {E}[\varvec{\omega }^{c}_i])\) and

$$\begin{aligned} \begin{aligned} A_i(\textbf{x})=\sum _{n=1}^{N^p_i}\mathbb {E}[\omega _{i,n}^p]\delta (\textbf{x}-\textbf{x}^p_{i,n}) +\int _0^\infty \omega \Lambda _i^1(\textbf{x},\omega )d\omega ,\\ B_i(\textbf{x})=\frac{1}{2}\sum _{n=1}^{N^p_i}\delta (\textbf{x}-\textbf{x}^p_{i,n})-\frac{1}{2} \int _0^\infty \Lambda _i^1(\textbf{x},\omega )d\omega . \end{aligned} \end{aligned}$$

4.8 Predictive distribution

The posterior distribution of the task-specific latent function \(g_i\) at a predictive point \(\textbf{x}\) is approximated by

$$\begin{aligned} q(g_i(\textbf{x}))=\int p(g_i(\textbf{x})\mid \textbf{g}_{\textbf{x}_m}^i)q(\textbf{g}_{\textbf{x}_m}^i)d\textbf{g}_{\textbf{x}_m}^i=\mathcal {N}(g_i(\textbf{x})\mid \mu ,\sigma ^2), \end{aligned}$$

where \(\mu =\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{m}_{\textbf{x}_m}^i\), \(\sigma ^2=k_{\textbf{x}\textbf{x}}^{i}-\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}+\textbf{k}_{\textbf{x}_m \textbf{x}}^{i\top }\textbf{K}_{\textbf{x}_m \textbf{x}_m}^{i^{-1}}\mathbf {\Sigma }_{\textbf{x}_m}^i\textbf{K}_{\textbf{x}_m\textbf{x}_m}^{i^{-1}}\textbf{k}_{\textbf{x}_m \textbf{x}}^{i}\). Therefore, \(\tilde{g}_i(\textbf{x})=\sqrt{\mu ^2+\sigma ^2}\), \(\bar{g}_i(\textbf{x})=\mu\), \(\mathbb {E}[\omega ]=\frac{b}{2c}\tanh \frac{c}{2}\) for \(p_{\text {PG}}(\omega \mid b,c)\) (Polson et al., 2013), \(\mathbb {E}[\log \bar{\lambda }_i]=\psi (N_i^p+R_i)-\log (|\mathcal {X}|)\) where \(\psi (\cdot )\) is digamma function. The intractable integral over \(\mathcal {X}\) is solved by numerical quadrature. Updating the variational posterior of each factor alternately by Eqs. (8), (9), (10), (11) and (12), we obtain approximate posteriors of \(\bar{\varvec{\lambda }}\) and \(\textbf{g}_{\textbf{x}_m}\).

4.9 Hyperparameters and computation complexity

The model hyperparameter \(\varvec{\Theta }\) comprises the kernel hyperparameters \(\{\varvec{\theta }_q\}_{q=1}^Q\) associated to the covariance functions \(\{k_q\}_{q=1}^Q\), the mixing weights \(\{\textbf{w}_q\}_{q=1}^Q\), the inducing inputs \(\{\textbf{x}_m\}_{m=1}^M\) and the noise variance \(\{\sigma _i^2\}_{i=1}^{I_r}\) in regression tasks. In this work, the inducing points are uniformly located on the domain, which means the kernel matrix has Toeplitz structure (Cunningham et al., 2008) and this can lead to more efficient matrix inversion. In the implementation, we do not apply this method and instead use the naive matrix inversion. \(\{\varvec{\theta }_q\}_{q=1}^Q\), \(\{\textbf{w}_q\}_{q=1}^Q\) and \(\{\sigma _i^2\}_{i=1}^{I_r}\) are optimized by maximizing the marginal likelihood, which is also called the empirical Bayes. Due to the intractability of marginal likelihood, we adopt an approximate approach: maximize the ELBO as a function of hyperparameters by alternating between updating variational parameters and hyperparameters. In the following, we derive the ELBO:

$$\begin{aligned} \begin{aligned}&\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p)\ge \\&\mathbb {E}_q[\log p(\textbf{y}^{r},\textbf{y}^{c},\textbf{x}^p\mid \varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})]-\text {KL}(q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }})\Vert p(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,g,\bar{\varvec{\lambda }}))\\&=\mathbb {E}_q[\log p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})]+\mathbb {E}_q[\log p(\textbf{y}^{c}\mid \varvec{\omega }^c,\{g_{i}^c\}_{i=1}^{I_c})]\\&+\mathbb {E}_q[\log p(\textbf{x}^p\mid \varvec{\omega }^p,\Pi ,\{g_{i}^p\}_{i=1}^{I_p},\bar{\varvec{\lambda }})]-\text {KL}(q(g)\Vert p(g))\\&-\text {KL}(q(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,\bar{\varvec{\lambda }})\Vert p(\varvec{\omega }^c,\varvec{\omega }^p,\Pi ,\bar{\varvec{\lambda }})), \end{aligned} \end{aligned}$$

where we omit the conditioning on hyperparameters. It is straightforward to see that, given variational posteriors, only the first term includes the noise variance \(\{\sigma _i^2\}_{i=1}^{I_r}\) and only the fourth term includes the kernel hyperparameters \(\{\varvec{\theta }_q\}_{q=1}^Q\) and the mixing weights \(\{\textbf{w}_q\}_{q=1}^Q\). All other terms are constant w.r.t. hyperparameters. After introducing the inducing points on g, we obtain the inducing points version:

$$\begin{aligned}&\mathbb {E}_q[\log p(\textbf{y}^{r}\mid \{g_{i}^r\}_{i=1}^{I_r})]=\sum _{i=1}^{I_r}\sum _{n=1}^{N_i^r}-\log (\sigma _i\sqrt{2\pi })-\frac{1}{2\sigma _i^2}({y_{i,n}^{r^2}}-2y_{i,n}^r\bar{g}_{i,n}^r+{\tilde{g}_{i,n}^{r^2}}) \end{aligned}$$
(13a)
$$\begin{aligned}&\text {KL}(q(\textbf{g}_{\textbf{x}_m})\Vert p(\textbf{g}_{\textbf{x}_m}))\nonumber \\&\quad =\frac{1}{2}\left( \log |\textbf{K}_{\textbf{x}_m\textbf{x}_m}|-\log |\mathbf {\Sigma }_{\textbf{x}_m}|-M\cdot I+\text {Tr}[\textbf{K}_{\textbf{x}_m\textbf{x}_m}^{-1}\mathbf {\Sigma }_{\textbf{x}_m}]+\textbf{m}_{\textbf{x}_m}^{\top }\textbf{K}_{\textbf{x}_m\textbf{x}_m}^{-1}\textbf{m}_{\textbf{x}_m}\right) , \end{aligned}$$
(13b)

where we assume \(p(\textbf{g}_{\textbf{x}_m})=\mathcal {N}(\textbf{g}_{\textbf{x}_m}\mid \textbf{0},\textbf{K}_{\textbf{x}_m\textbf{x}_m})\). Maximizing Eq. (13a), we obtain the optimal noise variance:

$$\begin{aligned} \sigma _i^{2*}=\left( \sum _{n=1}^{N_i^r}{y_{i,n}^{r^2}}-2y_{i,n}^r\bar{g}_{i,n}^r+{\tilde{g}_{i,n}^{r^2}}\right) /N_i^r. \end{aligned}$$
(14)

Minimizing Eq. 13b, we obtain the optimal kernel hyperparameters \(\{\varvec{\theta }_q\}_{q=1}^Q\) and mixing weights \(\{\textbf{w}_q\}_{q=1}^Q\), which has no closed-form solution and we resort to the automatic differentiation technique. The pseudocode of mean-field approximation is provided in Algorithm 1.

figure a

Defining S as the number of quadrature nodes on all point process tasks, the computational complexity of our mean-field approximation is dominated by the matrix inversion \(O(M^3I^3)\) and product \(O(M^2(N^r+N^c+N^p+S))\) where \(N^{\cdot }\) is the number of samples in the corresponding tasks.

4.10 Convergence and minibatch

The theoretical analysis in Hoffman et al. (2013) shows that performing the mean-field iteration for a conditionally conjugate model is equivalent to updating parameters by the natural gradient descent (Amari, 1998) with a step size of one. Therefore, our proposed mean-field approximation has inherently a faster convergence than the standard gradient descent.

The mean-field algorithm above uses all data. For further acceleration, we can resort to the stochastic variational inference (Hoffman et al., 2013) by subsampling the tasks, and samples in regression and classification tasks.

5 Experiments

In this section, we analyze our model and inference on synthetic and real-world datasets to demonstrate the performance in terms of transfer capability, efficiency and convergence. For all experiments, we use the RBF kernel \(k(\textbf{x},\textbf{x}')=\theta _0\exp {(-\frac{\theta _1}{2}\Vert \textbf{x}-\textbf{x}'\Vert ^2)}\) as covariance functions, and the usage of other kernels is outside of the scope of this paper. The implementation code is publicly available at https://github.com/zhoufeng6288/HGCox.

Baselines To show the superiority of our approach, we compare our model HMGCP against the single-task Cox process model: variational LGCP (Nguyen & Bonilla, 2014), and the multi-task models: MLGCP (Taylor et al., 2015) and MCPM (Aglietti et al., 2019).

Metrics We provide the comparison result of our model with baselines in terms of estimation error (EE), test log-likelihood (TLL), running time (RT) and convergence rate (CR). EE is the root mean square error (RMSE) between the estimated parameter and the ground truth. It is worth noting that EE is only applicable to synthetic data because the ground truth is required. TLL is the log-likelihood on test data using the posterior mean of parameters estimated from training data. RT is the running time of the inference algorithm. CR is the convergence rate of training log-likelihood w.r.t. the number of iterations.

5.1 Synthetic data: complete

To illustrate the performance of transfer capability, efficiency and convergence of our approach, we simulate three heterogeneous correlated tasks (one regression, one binary classification and one Cox process) by sampling three latent functions from a MOGP prior and using them to simulate the observed samples in regression, classification and Cox process tasks. We simulate three sets of synthetic data using three different sets of hyperparameters where latent functions vary from gently to drastically; each synthetic dataset contains both training and test data. We use two basis functions. The hyperparameters are \(\sigma ^2=0.1\), \(\varvec{\theta }_1=[1,0.001]\), \(\varvec{\theta }_2=[1,0.001]\), \(\textbf{w}_1=[0.9,0.5,0.1]\) and \(\textbf{w}_2=[0.1,0.5,0.9]\) for the first dataset; \(\sigma ^2=0.1\), \(\varvec{\theta }_1=[1,0.02]\), \(\varvec{\theta }_2=[2,0.001]\), \(\textbf{w}_1=[0.9,0.5,0.1]\) and \(\textbf{w}_2=[0.1,0.5,0.9]\) for the second dataset; \(\sigma ^2=0.1\), \(\varvec{\theta }_1=[1,0.1]\), \(\varvec{\theta }_2=[2,0.1]\), \(\textbf{w}_1=[0.9,0.5,0.1]\) and \(\textbf{w}_2=[0.1,0.5,0.9]\) for the third dataset.

For each dataset, we draw two basis functions \(\{f_q\}_{q=1}^2\) on the domain [0, 100] from two independent zero-mean GP priors with the corresponding kernel hyperparameters. The task-specific latent functions are \(\{g_i=\sum _{q=1}^2w_{i,q}f_q\}_{i=1}^3\). \(g_1\) is used as the mean of a Gaussian distribution \(\mathcal {N}(g_1(\textbf{x}),\sigma ^2)\) to draw samples for the regression task. \(g_2\) is passed through a sigmoid function and then used as the parameter of a Bernoulli distribution to draw samples for the binary classification task. \(g_3\) is passed through a sigmoid function and then scaled by \(\bar{\lambda }\) to serve as the intensity for simulating a Cox process. For regression and classification tasks, we assume the samples are uniformly distributed on the domain.

Our goal is to recover the intensity upper-bound \(\bar{\lambda }\) and latent functions \(\{g_i\}_{i=1}^3\). We use 30 inducing points that are uniformly distributed on the domain and 100 Gaussian quadrature nodes for the intractable integral. For initialization, the initial hyperparameters \(\sigma ^2\), \(\{\varvec{\theta }_q,\textbf{w}_q\}_{q=1}^2\) are set to the ground-truth hyperparameters and the variational parameters are initialized randomly. In the training process, the variational parameters and hyperparameters are updated concurrently. Specifically, the variational parameters are updated by the mean-filed iteration, the kernel hyperparameters \(\{\varvec{\theta }_q,\textbf{w}_q\}_{q=1}^2\) are updated by minimizing Eq. (13b) using the ‘SLSQP’ method, and the noise variance \(\sigma ^2\) is updated by Eq. (14). Figure 1 represents the estimated result for three datasets where we can see HMGCP is able to recover the ground truth. For convergence, HMGCP only takes 2-3 steps to converge in terms of training log-likelihood, which is much faster than the first-order gradient-based LGCP requiring more than 500 steps. More importantly, HMGCP has the better EE and TLL (Table 1) than the single-task LGCP that is trained independently and not able to transfer information to help recover the intensity of Cox process. For a fair comparison of efficiency, we run both HMGCP and LGCP on a single Cox process task with 400 iterations, and our inference is at least twice as fast as LGCP (Table 1) demonstrating its outstanding efficiency.

Fig. 1
figure 1

HMGCP recovers the latent functions \(g_1\), \(s(g_2)\) and \(\bar{\lambda }s(g_3)\) in three datasets whose posterior is constructed by 100 samples of g and \(\bar{\lambda }\) from the corresponding variational posterior. The shading area indicates one standard deviation. Blue dots are samples in regression task; red circles and blue crosses are positive and negative samples in classification task; blue bars are samples in Cox process task. For LGCP, we show the posterior mean intensity for the Cox process task

Table 1 The performance of EE, TLL and RT for HMGCP and LGCP on three synthetic datasets.

5.2 Synthetic data: missing

As far as we know, all current multi-task Cox process models exclusively focus on homogeneous scenarios. This does not apply to the more general heterogeneous multi-task setup where we need to transfer knowledge between multiple heterogeneous correlated tasks. In this section, we compare HMGCP against homogeneous multi-task baselines: MLGCP and MCPM. We construct four heterogeneous correlated tasks (one regression, one binary classification and two Cox processes) using the same method as in Sect. 5.1. We simulate one set of synthetic data that contains both training and test data. We use two basis functions. The hyperparameters are \(\sigma ^2=0.1\), \(\varvec{\theta }_1=[1,0.02]\), \(\varvec{\theta }_2=[2,0.001]\), \(\textbf{w}_1=[0.9,0.1,0.3,1.0]\) and \(\textbf{w}_2=[0.1,0.9,0.5,1.0]\). To further illustrate the heterogeneous transfer capability of our approach, in addition to the complete data, we follow the experimental setup of Aglietti et al. (2019): we create some missing gaps by evenly partitioning the domain into several regions and randomly masking four non-overlapping regions on four tasks (one for each task). To demonstrate the transfer capability on problems with different levels of difficulty, we experiment with two missing-gap widths: 5 and 10, where a wider missing gap means a more difficult transfer problem. For each missing-gap width, we experiment with ten random configurations of missing gaps.

We use 10 inducing points which are uniformly distributed on the domain. All the other experimental settings are the same as in Sect. 5.1. HMGCP successfully transfers knowledge between heterogeneous tasks by exploiting commonalities between them to recover the missing-gap latent functions for all tasks (Fig. 2), whereas MLGCP and MCPM exhibit the inferior generalization capability since they can only share information between Cox processes. Figure 2 shows the estimated latent functions for several configurations with 3 different missing-gap widths across tasks. Generally, the transfer of knowledge in regression and classification tasks is easier than that in Cox process tasks. This is because the likelihood of regression and classification only considers observed points, the function in the missing gap is entirely determined by the smoothness induced by prior. However, in addition to observed points, the Cox process likelihood also considers the domain where no points appear, so the function in the missing gap is determined by both prior and likelihood (zero-valued intensity). This makes the estimated intensity in the missing gap generally lower than the ground truth. For each missing-gap width, we report the statistics of EE and TLL for HMGCP, MLGCP and MCPM over ten random configurations of missing gaps in Table 2 where HMGCP outperforms alternatives in all experiments. The reason is HMGCP extracts useful information from regression, classification and other Cox processes to improve the estimation of intensity for the current Cox process, while MLGCP and MCPM cannot incorporate the information existing in heterogeneous tasks. As in Sect. 5.1, we run HMGCP, MLGCP and MCPM only on the complete Cox process data for a fair comparison of efficiency: HMGCP consumes 3.68 seconds, while MLGCP and MCPM consume 12.15 and 21.36 seconds, respectively (2000 iterations).

Table 2 The performance of EE and TLL for HMGCP, MLGCP and MCPM over ten random configurations of missing gaps with three different missing-gap widths (0 means complete data).
Fig. 2
figure 2

The estimated posterior of latent functions \(g_1\), \(s(g_2)\), \(\bar{\lambda }_3s(g_3)\) and \(\bar{\lambda }_4s(g_4)\) from HMGCP with missing-gap width being (a) 0, b 5 and c 10. For missing-gap widths 5 and 10, we show two configurations of missing gaps across tasks. The grey areas indicate the masked missing gaps. For MLGCP and MCPM, we show the posterior mean intensities for two Cox process tasks. The posterior variance in the missing gap does not increase significantly meaning HMGCP successfully transfers heterogeneous knowledge

5.3 Real data

In this section, we demonstrate the superiority of HMGCP in terms of heterogeneous knowledge transfer, efficiency and convergence on a real-world 2D urban data of Vancouver. The datasetFootnote 4 contains four parts of data (Fig. 3): (1) Employment income in Vancouver: the median employment income for full-year full-time workers in 2015 in the neighbourhoods of Vancouver; (2) Education in Vancouver: the number of population holding university certificate, diploma or degree at bachelor level or above in the neighbourhoods of Vancouver; (3) Crime in Vancouver: the recording of miscellaneous crimes (type, neighbourhood, latitude, longitude) in 2015 in Vancouver; (4) Non-market housing in Vancouver: the information of non-market housing projects (name, address, neighbourhood, latitude, longitude) that is for low and moderate income singles and families.

Fig. 3
figure 3

The median employment income (top left), education degree (top right), theft of vehicle (bottom left) and non-market house (bottom right) in 22 neighbourhoods of Vancouver

For the first dataset, we formulate it as a regression task, and use the centroid of each neighbourhood as the input, the median income as the output; for the second dataset, we formulate it as a binary classification task according to the degree of education: we divide the 22 neighbourhoods into ‘\(+1\)’ if there are more people holding university certificate, diploma or degree at bachelor level or above, and ‘\(-1\)’ if not; for the third and fourth datasets, we extract the locations of ‘Theft of Vehicle’ records in 2015 and non-market housing projects respectively, and formulate them as two Cox process tasks. On the basis of common sense, the income level, education degree, crime rate and non-market housing are closely correlated. Therefore, their integrative analysis offers more advantages compared to learning multiple tasks independently, which is susceptible to overfitting.

Fig. 4
figure 4

The estimated posterior mean latent functions \(g_1\), \(s(g_2)\), \(\bar{\lambda }_3s(g_3)\) and \(\bar{\lambda }_4s(g_4)\) from HMGCP with the mask size being (a) \(5\times 5\), (b) \(10\times 10\) and (c) \(20\times 20\) on each Cox process task. We show one configuration of masked regions across Cox process tasks. The black boxes indicate the masked regions

Fig. 5
figure 5

The estimated posterior mean intensity functions for two Cox process tasks from MLGCP and MCPM with the mask size being (a) \(5\times 5\), (b) \(10\times 10\) and (c) \(20\times 20\) on each Cox process task. We show one configuration of masked regions across Cox process tasks. The black boxes indicate the masked regions

To show the heterogeneous transfer capability of our approach, we compare HMGCP against MLGCP and MCPM. Due to lack of ground-truth latent functions, we cannot compare them in terms of EE but only TLL. We scale the area of Vancouver between longitude \([-123.226,-123.022]\) and latitude [49.20, 49.30] to the domain \([0,100]\times [0,50]\). We choose three basis functions by trial and error: we gradually increase the number of basis functions and find that using three basis functions can achieve excellent performance. Using more basis functions only has a slight impact on the performance on the test data, but leads to longer training time. The initial hyperparameters are set to \(\sigma ^2=0.1\), \(\varvec{\theta }_1=[1,0.01]\), \(\varvec{\theta }_2=[1,0.005]\), \(\varvec{\theta }_3=[1,0.001]\), \(\textbf{w}_1=[0.5,0.5,0.1,0.1]\), \(\textbf{w}_2=[0.1,0.5,0.2,0.5]\) and \(\textbf{w}_3=[0.5,0.1,0.5,0.2]\), and the variational parameters are initialized randomly. In the training process, the variational parameters and hyperparameters are updated concurrently. Specifically, the variational parameters are updated by the mean-filed iteration, the kernel hyperparameters \(\{\varvec{\theta }_q,\textbf{w}_q\}_{q=1}^3\) are updated by minimizing Eq. (13b) using the ‘SLSQP’ method, and the noise variance \(\sigma ^2\) is updated by Eq. (14). To assess the transfer capability with different levels of difficulty, we follow the experimental setup in Sect. 5.2: we randomly mask two non-overlapping regions on Crime in Vancouver and Non-market housing in Vancouver, one for each task, with three different mask sizes: \(5\times 5\), \(10\times 10\) and \(20\times 20\). A larger mask indicates a more difficult transfer problem. For each mask size, we experiment with ten random configurations of masks.

We use \(10\times 5\) uniformly distributed inducing points horizontally and vertically on each task and \(50\times 25\) Gaussian quadrature nodes for the intractable integral. We randomly mask regions as explained above, and use the remaining data for training and the masked data for testing. Figure 4 shows several examples of estimated latent functions from HMGCP with 3 different mask sizes (two examples for each size), while Fig. 5 shows the corresponding estimated intensity functions from MLGCP and MCPM. The black boxes in Fig. 4 represent several possible configuration of masked regions on two Cox process tasks. It is easily observed in the data that in terms of income level and education degree, the west is significantly higher than the east; while for crime rate and non-market housing, it is the other way around. HMGCP successfully transfers knowledge existing in regression and classification tasks to help recover the intensity functions in masked regions for Cox process tasks (Fig. 4), while MLGCP and MCPM are prone to overfitting because they can only transfer homogeneous knowledge (Fig. 5). Therefore, HMGCP defeats the competing baselines MLGCP and MCPM in terms of TLL in all experiments (Table 3). More importantly, HMGCP has a faster convergence, which needs 40-50 steps to converge in terms of training log-likelihood, than the first-order gradient-based MLGCP and MCPM requiring more than 400 and 1000 steps respectively (Fig. 6). Besides, HMGCP significantly outperforms MLGCP and MCPM in terms of efficiency (Table 3, only on two Cox process tasks for a fair comparison).

Table 3 The performance of TLL and RT for HMGCP, MLGCP and MCPM on the real data over ten random configurations of masked regions with three different sizes of mask. The mean and standard deviation (in brackets) are provided. Time in seconds
Fig. 6
figure 6

The training log-likelihood convergence of HMGCP, MLGCP and MCPM. HMGCP only takes 40–50 steps to converge, while MCPM and MLGCP require more than 400 and 1000 steps to converge respectively. MLGCP and MCPM achieve the higher training log-likelihood due to overfitting

6 Conclusion

The main objective of this study is to provide a heterogeneous multi-task learning framework for the analysis of multivariate inhomogeneous Poisson processes data with correlated regression and classification tasks. We adopt the MOGP prior to provide a shared representation to allow the transfer of knowledge between heterogeneous tasks. To circumvent the non-conjugate Bayesian inference, we employ the data augmentation technique to derive a closed-form mean-field approximation. Experimental results on synthetic and real data demonstrate that our model successfully shares the heterogeneous information to enhance the generalization capability and our inference approach has the predominant efficiency and convergence.

We adopted the LMC based MOGP to incorporate the correlation between multiple heterogeneous tasks. An interesting research track in the future may be the extension to MOGP based on process convolution, which may bring more benefits on computation efficiency. Moreover, we only consider three kinds of heterogeneous tasks: regression, classification and Cox process in this work; other kinds of unsupervised tasks, such as clustering, can also be attempted to be introduced to the multi-task framework.