1 Introduction

During recent decades more and more textual data has become available, creating a growing need to statistically analyze large amounts of textual data. The popular Latent Dirichlet Allocation (LDA) model introduced by Blei et al. (2003) is a generative probabilistic model in which each document is summarized by a set of latent semantic themes, often called topics. Formally, a topic is a probability distribution over the vocabulary. An estimated LDA model is, therefore, a compressed latent representation of the documents where each document is a mixture of topics and where each word (token) in a document belongs to a single topic. Most probabilistic topic models, such as LDA, are unsupervised, i.e. the topics are learned solely from the words in the documents without access to other document meta-data.

In many situations, though, there is other information we would like to incorporate in modeling a corpus of documents. A common example is when we have labeled documents, such as ratings of movies together with a movie description, illness categories in medical journals, or the locations of identified bugs together with bug reports in software engineering applications. The simplest approach would be to use a standard topic model and then use the estimated topic distributions per document in another model, such as a logistic regression model. This two-step approach would result in topics that are not produced for the purpose of explaining the dependent variable of interest. Alternatively, one could use a supervised topic model to find the semantic topic structure in the documents that are related to the class of interest. The difference between a supervised topic model and a two-step approach is similar to the difference between principal component regression (PCR) and partial least squares (PLS). In PCR, the principal components are first computed and then a regression model is estimated based on the estimated components. In PLS the components are estimated together with the regression model with the purpose of estimating components that have good predictive performance for the dependent variable of interest (Geladi and Kowalski 1986).

Most of the proposed supervised topic models have been designed with the objective of by trying to find good text classification models, and the focus has naturally been on the predictive performance. However, the predictive performance of most supervised topic models is similar to that of using a Support Vector Machine (SVM) with covariates based on word frequencies (Jameel et al. 2015). While predictive performance is certainly important, the real attraction of supervised topic models comes from their ability to learn semantically relevant topics and to use those topics to produce accurate interpretable predictions of documents or other textual data. The interpretability of a model is an often-neglected feature, but it is crucial in real-world applications. As an example, Parnin and Orso (2011) show that bug fault localization systems are quickly disregarded when the users cannot understand how the system has reached its predictive conclusion. Compared to other text classification systems, topic models are very well suited for interpretable predictions since topics are abstract entities that humans can easily grasp. The problems of interpretability in multi-class supervised topic models can be divided into three main areas.

First, most supervised topic models use a logit or probit approach, where the model is identified by the use of a reference category to which the effect of any covariate is compared. This defeats one of the main purposes of supervised topic models since it complicates the interpretability of the models. Instead of interpreting the effect of a topic on a class, we need to interpret it as the effect on a class compared to the reference category.

Second, to handle multi-class categorization a topic should be able to affect multiple classes, and some topics may not influence any class at all. In most supervised topic modeling approaches (such as Jiang et al. 2012; Zhu et al. 2013; Jameel et al. 2015) the multi-class problem is solved using binary classifiers in a “one-vs-all” classification approach. This approach works well in the situation of evenly distributed classes, but may not work well for skewed class distributions (Rubin et al. 2012). A one-vs-all approach also makes it more difficult to interpret the model. Estimating one model per class makes it impossible to see which classes are affected by the same topic and which topics do not predict any label. In these situations, we would like to have one topic model to interpret. The approach of one-vs-all is also costly from an estimation point of view since we need to estimate one model per class (Zheng et al. 2015), something that can be difficult in a situation with hundreds of classes.

Third, there can be situations with hundreds of classes and hundreds of topics [see Jonsson et al. (2016) for an example]. Without regularization or variable selection we would end up with a model with too many parameters to interpret and uncertain parameter estimates. In a good predictive supervised topic model, one would like to find a small set of topics that are strong determinants of a single document class label. This is especially relevant when the numbers of observations in different classes are skewed, which is a common problem in real-world situations (Rubin et al. 2012). In the more rare classes, we would like to induce more shrinkage compared to more common classes.

Multi-class regression is a non-trivial problem in Bayesian modeling. Historically, the multinomial probit model has been preferred due to the data augmentation approach proposed by Albert and Chib (1993). Augmenting the sampler using latent variables lead to straightforward Gibbs sampling with conditionally conjugate updates of the regression coefficients. The Albert-Chib sampler often tend to mix slowly, and the same holds for improved samplers such as the parameter expansion approach in Imai and van Dyk (2005). Recently, Polson et al. (2013) have proposed a similar data augmentation approach using Polya-gamma variables for the Bayesian logistic regression model. This approach preserves conditional conjugacy in the case of a Normal prior for the regression coefficients and was the foundation for the supervised topic model in Zhu et al. (2013).

In addition to the issue of interpretability of models, scalability of topic models is of crucial importance. MCMC algorithms are generally considered to be computationally costly. In the case of probabilistic topic models, this is even more prominent, since we often sample at least one parameter per word for the whole corpus. Modern corpora can be very large, with millions of documents, making efficient and parallel sampling a crucial component.

In this paper we explore a new approach to supervised topic models that produces accurate multi-class predictions from semantically interpretable topics using a fully Bayesian approach, hence solving all three of the above-mentioned problems. The model combines LDA with the recently proposed Diagonal Orthant (DO) probit model (Johndrow et al. 2013) for multi-class classification, with an efficient Horseshoe prior that achieves sparsity and interpretation by aggressive shrinkage (Carvalho et al. 2010). The new Diagonal Orthant Latent Dirichlet Allocation (DOLDA)Footnote 1 model has been demonstrated to have competitive predictive performance, while still producing interpretable multi-class predictions from semantically relevant topics. In addition, we also derive an efficient and parallel MCMC sampler that can be used to scale up model inference to larger corpora.

The paper is organized as follows. In Sect. 2, we describe the new proposed model and in Sect. 3 we then describe the proposed scalable MCMC sampler for the proposed model. In Sect. 4, the experimental results are presented and in Sect. 5 we conclude the paper. In the “Appendix”, a full derivation of the sampler is supplied.

2 Related work

Incorporating supervised information or meta-data in the estimation of topic models has been done in a large number of papers, such as Rosen-Zvi et al. (2004), Griffiths et al. (2005) and Chemudugunta et al. (2007) that incorporate author information, syntax and background structures, to give a few early examples. How topic models incorporate the labeled information, such as classes, can broadly be classified into two groups, downstream supervised models and upstream supervised models. In upstream topic models, loosely defined, the topics are conditioned on the labeled information, while in downstream topic models, the labeled information is conditioned on the topics. Examples of upstream topic models are topics conditioned on authorship (Rosen-Zvi et al. 2004), topical perspectives (Ahmed and Xing 2010) and more general supervised information (Mimno and McCallum 2012).

In downstream topic models, of which our proposed model is an example, the label information is instead conditioned on the topics, similar to conditioning on covariates in a linear regression model or a logistic regression model. One of the first approaches was proposed by McAuliffe and Blei (2008) were the authors propose a supervised topic model based on the generalized linear model framework, thereby making it possible to link binary, count, and continuous response variables to topics that are inferred jointly with the regression/classification effects. This idea has been elaborated further, especially in the case of classification, in a series of paper, all closely connected to this work. The three most related approaches are Jiang et al. (2012), that propose a downstream supervised topic model using a max-margin classification, Zhu et al. (2013) that propose a logistic supervised topic model using data augmentation with Polya-gamma variates and Perotte et al. (2011) that use a hierarchical binary probit approach to model a hierarchical label structure in the form of a binary tree. All of these models are downstream supervised topic models using MCMC for inference and different forms of data augmentation approaches to model classes. Zhu et al. (2013) and Perotte et al. (2011) in combine a data augmentation strategy together with a linear model, just our proposed model. Compared to these earlier work we differ in two major ways. First, we use the diagonal orthant data augmentation scheme that is computationally attractive for many classes, essentially addressing the issue of scalability in the number of classes. In addition, unlike the work of Zhu et al. (2013), Perotte et al. (2011) and Jiang et al. (2012), we focus on interpretability by using the horseshoe prior (Carvalho et al. 2010), something not done previously to ours. We also, just like (Zheng et al. 2015), focus on scalability in supervised topic models, but unlike (Zheng et al. 2015), we do not use Metropolis–Hastings to improve computational complexity. Instead, our approach focuses on the use of exchangeability to enable parallelism and analytically updates to make computations more efficient.

3 Diagonal Orthant Latent Dirichlet Allocation

3.1 Handling the challenges of high-dimensional interpretable supervised topic models

To solve the first and second challenges identified in the Introduction, i.e., reference classes and multi-class models, we propose the use of the Diagonal Orthant (DO) probit model in Johndrow et al. (2013) as an alternative to the multinomial probit and logit models. Johndrow et al. (2013) propose a Gibbs sampler for the Diagonal Orthant model and show that it mixes well. One of the benefits of the DO model is that all classes can be independently modeled using binary probit models when conditioning on the latent variable, thereby removing the need for a reference class. The parameters of the model can be interpreted as the effect of the covariate on the marginal probability of a specific class, which makes this model especially attractive when it comes to interpreting the inferred topics. This model also includes multiple classes in an efficient way that makes it possible to estimate a multi-class linear model in parallel over the classes.

The third problem of modeling supervised topic models is that the semantic meanings of all topics do not necessarily have an effect on our label of interest; one topic may have an effect on one or more classes, and some topics may just be noise that we do not want to use in the supervision. In cases where there are many topics and many classes, we will also have a very large number of parameters to analyze. The Horseshoe prior in Carvalho et al. (2010) was specifically designed to filter out signals from massive noise. This prior uses a local-global shrinkage approach to shrink some (or most) coefficients to zero while allowing for sparse signals to be estimated without any shrinkage. This approach has shown good performance in linear regression-type situations (Castillo et al. 2015), with many predictors (Nalenz and Villani 2018), which makes it straightforward to incorporate other covariates into our model, something that is rarely done in the area of supervised topic models. Different global shrinkage parameters are used for the different classes to handle the problem with an unbalanced number of observations in different classes. This makes it possible to shrink more when there is less data for a given class and shrink less in classes with more observations.

3.2 Generative model

The DOLDA generative model is described below. See also a graphical description of the model in Fig. 1. A summary of the notation is given in Table 1.

  1. (1)

    For each topic \(k=1,\ldots ,K\)

    1. (a)

      Draw a distribution over words \(\phi _{k}\sim \text{ Dir }_{V}(\beta )\)

  2. (2)

    For each label \(l\in L\)

    1. (a)

      Draw a global shrinkage parameter \(\tau _{l}\sim C^{+}(0,1)\)

    2. (b)

      For each covariate and topic \(p=1,\ldots ,K+P\)

      1. (i)

        Draw local shrinkage parameter \(\lambda _{l,p}\sim C^{+}(0,1)\)

      2. (ii)

        Draw coefficientsFootnote 2\(\eta _{l,p}\sim {\mathcal {N}}(0,\tau _{l}^{2}\lambda _{l,p}^{2})\)

  3. (3)

    For each observation/document \(d=1,\ldots ,D\)

    1. (a)

      Draw topic proportions \(\theta _{d}\sim \text{ Dir }_{K}(\alpha )\)

    2. (b)

      For each token \(n=1,\ldots ,N_{d}\)

      1. (i)

        Draw topic assignment \(z_{n,d}|\theta _{d}\sim \text{ Categorical }(\theta _{d})\)

      2. (ii)

        Draw word \(w_{n,d}|z_{n,d},\phi _{z_{n,d}}\sim \text{ Categorical }(\phi _{z_{n,d}})\)

    3. (c)

      \(y_{d}\sim \text{ Categorical }({\mathbf {p}}_{d})\) where

      $$\begin{aligned} {\mathbf {p}}_{d}=\left[ \sum _{l}^{L}cdf_{{\mathcal {N}},l}\left( (\bar{{\mathbf {z}}}, {\mathbf {x}})_{d}^{\top }\eta _{l\cdot }\right) \right] ^{-1}\left( F_{{\mathcal {N}}} \left( (\bar{{\mathbf {z}}},{\mathbf {x}})_{d}^{\top }\eta _{1\cdot }\right) ,\ldots ,F_{{\mathcal {N}}}\left( (\bar{{\mathbf {z}}},{\mathbf {x}})_{d}^{\top } \eta _{L\cdot }\right) \right) \end{aligned}$$

      and \(F_{{\mathcal {N}}}(\cdot )\) is the univariate normal CDF (Johndrow et al. 2013).

Table 1 DOLDA model notation
Fig. 1
figure 1

The Diagonal Orthant probit supervised topic model (DOLDA)

4 Inference

4.1 The MCMC algorithm

Markov Chain Monte Carlo (MCMC) is used to estimate the model parameters. We use different global shrinkage parameters \(\tau _{l}\) for each class, based on the fact that the different classes can have a different number of observations. This gives the following sampler for inference in DOLDA, see “Appendix A” for details.

  1. (1)

    Sample the latent variables \(a_{d,l}\sim {{\mathcal {N}}}_{+}(({\mathbf {x}}\,\bar{{\mathbf {z}}})_{d}^{T}\eta _{l},1)\) for \(l=y_{d}\) and \(a_{d,l}\sim {{\mathcal {N}}}_{-}(({\mathbf {x}}\,\bar{{\mathbf {z}}})_{d}^{T}\eta _{l},1)\) for \(l\ne y_{d}\), where \({\mathcal {N}}_{+}\) and \({\mathcal {N}}_{-}\) are the positive and negative truncated normal distribution, truncated at 0.

  2. (2)

    Sample all of the regression coefficients as in an ordinary Bayesian linear regression per class label l where \(\eta _{l}\sim {\mathcal {MVN}}\left( \mu _{l},(({\mathbf {X}}\,\bar{{\mathbf {Z}}})^{T}{({\mathbf {X}}\,\bar{{\mathbf {Z}}})}+\tau _{l}^{2}\Lambda _{l})^{-1}\right) \) and \(\Lambda _{l}\) is a diagonal matrix with the local shrinkage parameters \(\lambda _{i}\) per parameter in \(\eta _{l}\) and \(\mu _{l}=(({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}({\mathbf {X}}\, \bar{{\mathbf {z}}})+\tau _{l}^{2}\Lambda _{l})^{-1}({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}{\mathbf {a}}_{l}\)

  3. (3)

    Sample the global shrinkage parameters \(\tau _{l}\) at iteration j using the following two step slice sampling:

    $$\begin{aligned} u\sim & {} {\mathcal {U}}\left( 0,\left[ 1+\frac{1}{\tau _{l,(j-1)}}\right] {}^{-1}\right) \\ \frac{1}{\tau _{l,j}^{2}}\sim & {} {{\mathcal {G}}}\left( (p+1)/2, \frac{1}{2}\sum _{p=1}^{K+P}\left( \frac{\eta _{l,p}}{\lambda _{l,p}} \right) ^{2}\right) I\left[ \frac{1}{\tau _{l,(j-1)}^{2}}<(1-u)/u\right] \end{aligned}$$

    where I indicates the truncation region for the truncated gamma.

  4. (4)

    Sample each local shrinkage parameter \(\lambda _{i,l}\) at iteration j as

    $$\begin{aligned} u\sim & {} {\mathcal {U}}\left( 0,\left[ 1+\frac{1}{\lambda _{p,l,(j-1)}^{2}} \right] {}^{-1}\right) \\ \frac{1}{\lambda _{p,l,j}^{2}}\sim & {} \text {Exp}\left( \frac{1}{2}\left( \frac{\eta _{l,p}}{\tau _{l}} \right) ^{2}\right) I\left[ \frac{1}{\lambda _{p,l,(j-1)}^{2}}<(1-u)/u\right] \end{aligned}$$

    where I indicates the truncation region for the truncated exponential distribution.

  5. (5)

    Sample the topic indicators \({\mathbf {z}}\)

    $$\begin{aligned} p(z_{i,d}= & {} k|w_{i},{\mathbf {z}}^{\lnot i},\eta ,{\mathbf {a}}) \propto \phi _{v,k}\cdot \left( {\mathbf {M}}_{d,k}^{\lnot i}+\alpha \right) \\&\times \exp \left( -\frac{1}{2}\sum _{l}^{L} \left[ -2\frac{\eta _{l,k}}{N_{d}}\left( a_{d,l}-(\bar{{\mathbf {z}}}_{d}^{\lnot i} \, {\mathbf {x}}_{d})\eta _{l}^{\intercal }\right) + \left( \frac{\eta _{l,k}}{N_{d}}\right) ^{2}\right] \right) \end{aligned}$$

    where \({\mathbf {M}}\) is a \(D\times K\) count matrix containing the sufficient statistics for \(\Theta \) and \({\mathbf {M}}^{\lnot i}\) is this matrix with topic indicator \(z_{i,d}\) removed from \({\mathbf {M}}\).

  6. (6)

    Sample the topic-vocabulary distributions \(\Phi \)

    $$\begin{aligned} \phi _{k}\sim \text{ Dir }(\beta +{\mathbf {N}}_{k,\cdot }) \end{aligned}$$

    where \({\mathbf {N}}\) is a \(K\times V\) count matrix containing the sufficient statistics for \(\Phi \).

4.2 Efficient parallel sampling of \({\mathbf {z}}\)

To improve the speed of the sampler we cache the calculations done in the supervised part of the topic indicator sampler and parallelize the sampler. Very large text corpora are increasingly common, so efficient sampling of the \({\mathbf {z}}\) is absolutely crucial in practice. The basic sampler for \({\mathbf {z}}\) can be slow due to the serial nature of the collapsed sampler and the fact that the supervised part of \(p(z_{i,d})\) involves a dot product. A naive implementation would result in a complexity of \(O((K+P) \cdot L \cdot K)\) to sample just one topic indicator \(z_i\).

The supervised part of document d can be expressed as \(g_{d,k}^{\lnot i}\) where

$$\begin{aligned} g_{d,k}^{\lnot i}=\exp {-\frac{1}{2}\sum _{l}^{L} \left[ -2\frac{\eta _{l,k}}{N_{d}}\left( a_{d,l}-( \bar{{\mathbf {z}}}_{d}^{\lnot i}\,{\mathbf {x}}_{d}) \eta _{l}^{\intercal }\right) +\left( \frac{\eta _{l,k}}{N_{d}}\right) ^{2}\right] }. \end{aligned}$$

By realizing that sampling a topic indicator \(z_{i,d}\) will only change this part a little, we can derive the relationship

$$\begin{aligned} g_{d,k}^{\lnot i}=g_{d,k}^{\lnot (i-1)}+\frac{1}{N_{d}^{2}} \left[ \sum _{l}^{L}\eta _{l,k}\eta _{l,z_{i,d}}-\sum _{l}^{L} \eta _{l,k}\eta _{l,z_{i-1,d}}\right] , \end{aligned}$$

where \(g_{d,k}^{\lnot (i-1)}\) is the supervised effect computed for the previous token and where the expression \(\sum _{l}^{L}\eta _{l,k}\eta _{l,z_{i,d}}\) can be calculated once per iteration in \(\eta \) and can be stored in a two-dimensional array of size \(K^{2}\). We can use the above relationship to update the supervision after sampling each topic indicator by calculating \(g_{d,k}^{\lnot i}\) “on the fly” based on the previous supervised contribution. This means that we only need to compute \(g_{d,k}^{\lnot i}\) once per document, and then we just need to update these values. This approach to computing \(g_{d,k}^{\lnot i}\) increases the speed by an order of magnitude for a model with 100 topics and reduces the computational complexity of sampling one \(z_{i,d}\) from \(O((K+P) \cdot L \cdot K)\) to O(K) for all but the first token per document. For details, see “Appendix B”.

To further improve the performance we parallelize the sampler and use the fact that documents are conditionally independent given \(\Phi \). By sampling \(\Phi \) instead of marginalizing it out we will gain from parallelization with the additional cost of sampling \(\Phi \). This approach to parallelizing topic models give us a sampler that correctly samples the posterior using an ergodic Markov chain, unlike other parallel approaches such as AD-LDA (Magnusson et al. 2018; Newman et al. 2009).

In summary, we propose a sampler that samples the \(z_{i,d}\) in parallel over the documents, the elements in \(\Phi \) sampled in parallel over topics, and sampling \(\eta \) can be in parallel over classes. The code is publicly available at https://github.com/lejon/DiagonalOrthantLDA.

4.3 Computational complexity

The computational complexity of the sampler depends on the different parameters sampled. Below we analyze the different parts of the sampler for one iteration. Sampling the regression coefficients has three parts, (1) computing \(\Lambda _{post}=(({\mathbf {X}}\, \bar{{\mathbf {z}}})^{T}({\mathbf {X}}\, \bar{{\mathbf {z}}})+\tau _{l}^{2}\Lambda _{l})\) is of complexity \(O(L \cdot D \cdot (K+P)^2)\), (2) inverting \(\Lambda _{post}\) for all classes is of complexity \(O(L \cdot (K+P)^3)\), and (3) sampling from the multivariate Gaussian distribution of each \(\eta _l\) is also of complexity \(O(L \cdot (K+P)^3)\). Sampling the topic indicators has complexity \(O((K+P) \cdot L \cdot K \cdot D)\) for the first topic indicator in each document, a complexity dominated by \(O(L \cdot D \cdot (K+P)^2)\). If we use the method for increased efficiency proposed above, all other topic indicators can be sampled with complexity \(O(K \cdot N)\). Sampling the latent variables \({\mathbf {a}}\) is of complexity \((K+P) \cdot D \cdot L\) and is hence dominated by the sampling of \(\eta \). Similarly, sampling \(\tau \) and \(\lambda \) is of complexity \((K+P) \cdot L\) and is also dominated by the sampling of \(\eta \). Sampling \(\Phi \) is of complexity \(O(K \cdot V)\), something that is generally dominated by sampling the topic indicators \(O(K \cdot N)\), since topically \(N>>V\) (Magnusson et al. 2018).

Finally, the total complexity of the sampler, with regard to the number of classes (L), the number of topics (K), the number of documents (D), the mean document size (\({\bar{N}}\)), and the number of covariates (P) is \(O(L \cdot (K+P)^3 + L \cdot D \cdot (K+P)^2 + K \cdot D \cdot {\bar{N}})\) where \({\bar{N}}=N/D\). From this analysis, we can see that as the corpus grows (\(D \rightarrow \infty \)) we see that sampling \(\eta \) and the topic indicators \({\mathbf {z}}\) will dominate the computations. But we would also expect the number of topics to grow as the number of documents grows. In this situation, the main cost of the algorithm would be sampling the \(\eta \)s and the first topic indicator of each document.

Due to the similarity of the DOLDA sampler to that of the MedLDA sampler, it is straightforward to use the cyclical Metropolis–Hastings proposals in Zheng et al. (2015) for inference in DOLDA. But, as shown in Magnusson et al. (2018), it is not obvious that the reduction in sampling complexity will result in a faster sampler when MCMC efficiency is taken into account.

4.4 Evaluation of convergence and prediction

We evaluate the convergence of the MCMC algorithm by monitoring the unnormalized log-likelihood over the iterations:

$$\begin{aligned}&\log {\mathcal {L}}({\mathbf {w}},{\mathbf {y}}|{\mathbf {z}},\eta , {\mathbf {X}}, \alpha , \beta ) \\&\quad = \log p({\mathbf {y}}|{\mathbf {z}},\eta , {\mathbf {X}}) + \log p({\mathbf {w}}|{\mathbf {z}}, \alpha , \beta ) \\&\quad \propto \sum _{d}^{D}\log \left[ (1-F_{{\mathcal {N}}}(-(\bar{{\mathbf {z}}}_{d}\, {\mathbf {x}}_{d})\eta _{j}^{\intercal }))\prod _{l\ne j}F_{{\mathcal {N}}}(-(\bar{{\mathbf {z}}}_{d}\, {\mathbf {x}}_{d})\eta _{l}^{\intercal })\right] \\&\qquad - \sum _{d}^{D}\log \left[ \sum _{j=1}^{J}(1-F_{{\mathcal {N}}}(-(\bar{{\mathbf {z}}}_{d}\, {\mathbf {x}}_{d})\eta _{j}^{\intercal }))\prod _{l\ne s} F_{{\mathcal {N}}}(-(\bar{{\mathbf {z}}}_{d}\, {\mathbf {x}}_{d})\eta _{l}^{\intercal })\right] \\&\qquad + K\log \Gamma \left( \sum ^{V}\beta \right) -KV\log \Gamma \left( \beta \right) +\sum ^{K}\sum ^{V}\log \Gamma \left( {\mathbf {N}}_{k,v}+\beta \right) \\&\qquad -\sum ^{K}\log \Gamma \left( \sum ^{V}{\mathbf {N}}_{k,v}+\beta \right) \\&\qquad + D\log \Gamma \left( \sum ^{K}\alpha \right) -DK\log \Gamma \left( \alpha \right) +\sum ^{D}\sum ^{K}\log \Gamma \left( {\mathbf {M}}_{d,k}+\alpha \right) \\&\qquad -\sum ^{D}\log \Gamma \left( \sum ^{K}{\mathbf {M}}_{d,k}+\alpha \right) , \end{aligned}$$

where \(F_{{\mathcal {N}}}\) is the univariate normal distribution and the last part is the same computations commonly used in evaluating the standard LDA model.

To make predictions for a new document \(d^\star \) we first need to sample the topic indicators of the given document from

$$\begin{aligned} p(z_{i,d^\star }=k|{\mathbf {w}}^{\star },\Phi )\propto {\bar{\phi }}_{k,v}\cdot \left( {\mathbf {M}}^{\lnot i}_{d,k}+\alpha \right) , \end{aligned}$$

where \({\bar{\phi }}_{k,v}\) is the mean of the last part of the posterior draws of \(\Phi \). We use the posterior mean based on the last iterations instead of integrating out \(\Phi \) to avoid potential problems with label switching. However, we have not seen any indications of label switching after convergence in our experiment, probably because the data sets used for document predictions are usually quite large. The topic indicators are sampled for the predicted document using the fast PC-LDA sampler in Magnusson et al. (2018). The mean of the sampled topic indicator vector for the predicted document, \(\bar{{\mathbf {z}}}^{\star }\), is then used for class predictions:

$$\begin{aligned} y^{\star }=\arg \max \left( (\bar{{\mathbf {z}}}^{\star },{\mathbf {x}}^{\star })^{\top }\eta \right) . \end{aligned}$$

This is a maximum a posteriori estimate, but it is straightforward to calculate the whole predictive distribution for the label.

5 Experiments

We study model performance in four different ways: the classification accuracy, the interpretability of the model, the topic quality and supervision effects on topics, and the scalability of the sampler. The experiments are performed on 2 sockets with 8-core Intel Xeon E5-2660 Sandy Bridge processors at 2.2GHz at the National Supercomputer Center (NSC) at Linköping University.

5.1 Corpora and priors

To study the different aspects of the DOLDA model we use multiple corpora. We collected a corpus containing the 10,810 highest-rated movies at IMDb.com. We use both the textual description and information about producers and directors to classify a given movie into a genre. We also analyze the classical 20 Newsgroup corpus.

In addition, we also include two corpora based on the New York Times Annotated Corpus (Sandhaus 2008) for our experiments. To label the documents we use the classification in the “online section”. Thus, we only use articles from 2001 and later, when the “online section” was added. From these documents, we extract labels that we call “Top level” and “Hierarchical”. These labels are used as the class of the documents. An example of an “online section” is

Arts; Dining and Wine; Education; Books

A document with the above example online-section would get the top label “Arts” and the hierarchical (2 level) label “Arts; Dining and Wine”. For the hierarchical classification, we extracted only the articles which had at least two levels (“Arts” being the first level and “Dining and Wine” the second level in the example above). After extracting the classes for the documents we create four subsets from the corpora described above. These subsets contain 90%, 60%, 20%, and 10% of documents sampled from the above corpus. None of the documents in the 10% subset exists in the other subsets. The purpose of the NYT corpora is to show how the sampler scales, both with regard to documents (\(\sim \) 600 K) and with a large number of classes (240).

Our companion paper (Jonsson et al. 2016) applies the DOLDA model developed here to bug localization in a large-scale software engineering context using a corpus with 15,000 bug reports, each belonging to one of 118 classes.

We include the corpora for different purposes. IMDb is a smaller corpus, but contains additional covariates. The 20 Newsgroups corpus is included to enable accuracy performance with other comparable supervised topic models. The New York Times corpus is included to show the scalability of the MCMC sampler with regard to the number of classes as well as the number of documents.

The corpora are tokenized and a standard stop list of English words are removed, as well as the rarest word types that make up 1% of the total tokens, or in some experiments, the words that occur less than 10 times. In the IMDb corpus, we only include genres with at least 10 movies (Table 2).

Table 2 Corpora used in experiment, by the number of classes (L), the number of documents (D), the vocabulary size (V), and the total number of tokens (N)

In all experiments, we use a relatively vague prior setting \(\alpha =\beta =0.01\) for the LDA part of the model and \(c=100\) for the prior variance of the \(\eta \) coefficients in the normal model prior and for the intercept coefficient when using the Horseshoe prior. The accuracy experiment for IMDb uses 5-fold cross-validation and the 20 Newsgroups corpus uses the same training and test set as in Zhu et al. (2012) to enable direct comparisons of accuracy. In the interpretability analysis of the IMDb corpus we use the whole corpus, without cross-validation.

5.2 Results

5.2.1 Classification accuracy

Figure 2 shows the accuracy on the hold-out test set for the 20 Newsgroups corpus for different numbers of topics. The accuracy of our model is slightly lower than MedLDA and SVM using only textual features, but higher than both the classical supervised multi-class LDA and the ordinary LDA together with an SVM approach.

We can also see from Fig. 2 that the accuracy of using the DOLDA model with the topics jointly estimated with the supervision part outperforms a two-step approach of first estimating LDA and then using the DO probit model with the pre-estimated mean topic indicators as covariates. This is true for both the Horseshoe prior and the normal prior, but the difference with regard to accuracy is just a few percentage points.

Fig. 2
figure 2

Accuracy of MedLDA, taken from Zhu et al. (2012) (left) and accuracy of DOLDA for the 20 Newsgroup test set (right)

The advantage of DOLDA is that it produces interpretable predictions with semantically relevant topics. Therefore, it is reassuring that DOLDA can compete in accuracy with other less interpretable models such as the SVM, even when the model is dramatically simplified by aggressive Horseshoe shrinkage for interpretational purposes. Our next analysis illustrates the interpretational strength of DOLDA. See also our companion paper (Jonsson et al. 2016) in the software engineering literature for further demonstrations of DOLDA’s ability to produce interpretable predictions in industrial applications.

Figure 3 displays the accuracy on the IMDb corpus as a function of the number of topics. The estimated DOLDA model also contains several other discrete covariates, such as the film’s director and producer. The accuracy of the more aggressive Horseshoe prior is better than the normal prior for all topic sizes. A supervised approach with topics and supervision inferred jointly again outperforms a two-step approach.

Fig. 3
figure 3

Accuracy for DOLDA on the IMDb data with normal and Horseshoe prior and using a two step approach with the Horseshoe prior

5.2.2 Model interpretability

To illustrate the interpretability of DOLDA, we fit a new model to the IMDb corpus using only topics as covariates. Note first in Fig. 4 how the Horseshoe prior is able to distinguish between so-called signal topics and noise topics; the Horseshoe prior aggressively shrinks a large fraction of the regression coefficient toward zero, making it much easier to interpret how different latent aspects of the documents affect the class label. This is achieved without the need of setting any additional hyper-parameters in the model.

Fig. 4
figure 4

Coefficients for the IMDb corpus with 80 topics using the normal prior (left) and the Horseshoe prior (right)

The Horseshoe shrinkage makes it easy to identify the topics that affect a given class. This is illustrated for the Romance genre in the IMDb corpus in Fig. 5. This genre consists of relatively few observations (only 39 movies), and the Horseshoe prior, therefore, shrinks most coefficients to zero, keeping only one large signal topic that happens to have a negative effect on the Romance genre. The normal prior, on the other, hand gives a much denser, and therefore a much less interpretable solution.

Fig. 5
figure 5

Coefficients for the genre Romance in the IMDb corpus with 80 topics using the Horseshoe prior (upper) and a normal prior (below)

For a further analysis of what triggers a Romance genre label, Table 3 shows the 10 top words for Topic 39. From this table, it is clear that the signal topic identified using the Horseshoe prior is some sort of “crime” topic that is negatively related to the Romance genre, which makes intuitive sense. The crime topic is clearly expected to be positively related to the Crime genre, and Fig. 6 shows that this is indeed the case.

Table 3 Top words in topics using the Horseshoe prior
Fig. 6
figure 6

Regression coefficients for the class Crime for the IMDb corpus with 80 topics using the Horseshoe prior (upper) and a normal prior (below)

We can also see from Fig. 6 that Topic 33 has a strong negative effect on the Crime genre. In Table 3 we can see that Topic 33 seems to be some sort of Sci-Fi topic. This topic has, in turn, the largest positive relationship with the Sci-Fi movie genre.

This illustrates an example how the aggressive shrinkage of the Horseshoe prior not only increases the prediction accuracy, but also simplifies interpretations since a much smaller number of topics is estimated to affect a given label - making it easier to focus on the topics that actually have an effect in the analysis. This is much more difficult in the Normal prior situation.

5.2.3 Topic quality and effect of supervision

Even though the accuracy improves using a supervised approach, this raises the question of the effect of the supervision on the quality of the individual topics. How are the topics affected by the supervision and by shrinkage priors?

To study the effect on the topics, we focus on two measurements of topic quality. First, we study the effect of topic coherence using the measure proposed by Mimno et al. (2011). This measure has been shown in experiments to be a good approximation of topical coherence, estimated using manual annotations, such as topic intrusion (Chang et al. 2009).

We also study the document entropy of the topics. This measure gives us an indication of how the topics are distributed over documents. Are topics evenly distributed over documents (high entropy) or more sparsely distributed over documents (low entropy). This is an indication of the effect that supervision has on the topics. Is the supervised information making the distribution over documents more or less sparse?

In Fig. 7 we can see that, in general, there is no large difference in coherence or document entropy between the different models and priors, which is also true for the other corpora (not shown). This indicates that the effect of the supervision on the inferred topics is small; document entropy and topic coherence remains more or less the same with and without supervision.

Fig. 7
figure 7

Document entropy (left) and topic coherence (right) for the IMDb corpus

To study the effect of the supervision in more detail, we focus instead on those topics that are actually affected by the supervision in the model. Since we have L number of coefficients that affect each topic, we choose to study the supervised effect on topic k, called \(r_k\), by looking at the sum of the absolute values of the \(\eta \) coefficients, i.e.

$$\begin{aligned} r_k = \sum ^L_{l=1} |\eta _{l,k}|. \end{aligned}$$

This is a rough estimate of the overall supervised effect on the individual topics. We also studied negative and positive regression coefficients separately, but the results are similar. In Fig. 8 the results are presented for different corpora, priors, and topic quality measures. In all cases, we use \(K=50\). The results are similar to those of \(K=100\).

Fig. 8
figure 8

Coherence and document entropy by supervised effect with 50 topics

Figure 8 show the effect of the supervision. We can see the effect of the horseshoe prior in that the regression supervised effects are lower in general, due to the shrinkage imposed by the prior. With regard to topic coherence and document entropy, the supervision has no clear effect. Instead, it seems like the effect that the supervision has on the topics is corpus-specific. In the 20 Newsgroups corpus, we can see a small positive relationship between coherence and supervision effect, something that is not shared by the other corpora.

These results indicate that the effect of the supervision on individual topics depends on the corpora (and the label). This makes sense in that different types of labels will relate to different aspects of the text and the underlying topics. For the 20 Newsgroups, we can see a slight positive correlation between coherence and the supervised effects, and this is also a corpus with higher prediction accuracy. But overall, the results seem to indicate that there is no clear effect of the supervision on topic quality, as measured by document entropy and coherence.

5.2.4 Scaling and parallelism performance

One of the contributions of this model is its ability to scale to larger corpora using a parallel and efficient MCMC sampler. To study the scalability of the model we ran experiments on the runtime effects on the different aspects identified in the complexity analysis, the number of classes, the number of documents and the number of topics.

Fig. 9
figure 9

Scaling performance (left) and parallel performance (right). The scaling experiments were run for 5000 iterations and the parallel performance experiments were run for 1000 iterations each. All were run with 3 different random seeds and the average runtime was computed. In the parallel experiment, the 20% NYT Hierarchical data was used and 2, 4, 8, and 16 cores

Figure 9 show the results of the scaling and parallel performance experiments. From the results, we can see that the scaling in size of the corpus is more or less linear, as we expect. More interestingly, the runtime of the two NYT corpora is very similar. The hierarchical NYT corpus has roughly ten times more classes (240 vs. 31) than the top-level NYT corpus while being roughly one third in size. Hence we can conclude that, empirically, the largest effect on runtime is the number of documents, or tokens, rather than the number of classes for a standard setting with 100 topics.

Figure 9 also shows the parallel performance of the sampler. Unlike most other large-scale MCMC samplers for topic models, this sampler is both parallel and samples using an ergodic Markov chain with the posterior as the target. This still gives good parallel performance, even on a smaller corpus, such as the IMDb corpus. It is also obvious that the concurrency with regard to the different classes is also of importance. For the hierarchical NYT corpus, the increased number of classes affects the overall sampling time, but the parallel performance is not affected. We can also see that the parallel performance is needed mainly when the number of topics is larger.

6 Conclusions

Several supervised topic models have been proposed with the purpose of identifying topics that can be used successfully to classify documents. We have proposed DOLDA, a supervised topic model with special emphasis on generating semantically interpretable predictions together with an efficient and scalable MCMC sampler for inference. An important component of the model to ease interpretation is the DO-probit model without a reference class. By coupling the DO-probit model with an aggressive Horseshoe prior with a shrinkage that is allowed to vary over the different classes, it is possible to create an interpretable classification model that automatically identifies the most interesting “signal” topics. At the same time, the DOLDA model comes with very few hyperparameters - only the standard LDA parameters \(\alpha \) and \(\beta \) are needed, which has been extensively studied in Wallach et al. (2009). The fact that there are so few parameters is different from most other supervised topic models (Jiang et al. 2012; Zhu et al. 2012; Li et al. 2015).

Our experiments show that the gain in interpretation from using DOLDA comes with only a small reduction in prediction accuracy compared to the state-of-the-art supervised topic models; moreover, DOLDA outperforms other fully Bayesian models such as the original supervised LDA model. We have also shown that learning the topics jointly with the classification part of the model gives more accurate predictions than a two-step approach, where a topic model is first estimated and a classifier is then trained on the learned topics, showing a general benefit of supervised topic modeling.

The horseshoe prior has also shown benefits in supervised topic models, leading to a much more clear picture of the important topics for a given label with similar, or better, prediction accuracy. The computational cost of the horseshoe is small compared to the other parts of the sampler, making it an attractive prior for use in other supervised models as well.

The supervision effect on the topics is generally small and in line with previous results. The supervision, in general, does not seem to affect the topic interpretability much, but there seems to be an indication that this is corpus (and label) dependent. In the 20 Newsgroups corpus, where the accuracy is higher, the relationship between topic coherence and supervision effects are slightly positive.

Finally, we show that the DOLDA model scales well for large corpora and many classes. Still, further improvement in scalability can be achieved with regard to the number of topics K. The ideas of Zheng et al. (2015) can probably improve the scalability with respect to K, the number of topics, but this is something we leave for future work.