- Research
- Open access
- Published:
INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis
Algorithms for Molecular Biology volume 16, Article number: 17 (2021)
Abstract
Motivation
Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data.
Contribution
In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time.
Results
We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.
Background
Drug resistance is the phenomenon by which an infectious organism (also known as pathogen) develops resistance to one or more drugs that are commonly used in treatment [1]. In this paper we focus our attention on Mycobacterium tuberculosis (MTB), the etiological agent of tuberculosis, which is the largest single infectious agent killer in the world today, responsible for over 10 million detected cases and approximately 1.4 million deaths only in 2019 [2].
The development of resistance to common drugs used in treatment is a serious public health threat, not only in low and middle-income countries, but also in high-income countries where it is particularly problematic in hospital settings [3]. It is estimated that, without the urgent development of novel antimicrobial drugs, the total mortality due to drug resistance will exceed 10 million people a year by 2050, a number exceeding the annual mortality due to cancer today [4].
Existing models for predicting drug resistance from whole-genome sequencing (WGS) data broadly fall into two classes. The first, which we refer to as “catalogue methods,” involves testing the WGS data of an isolate for the presence of point mutations (most often single-nucleotide polymorphisms, or SNPs) associated with known drug resistance. These mutations are typically identified via a microbial genome-wide association study (GWAS) and may be confirmed with a functional genomics study. If at least one previously identified mutation is present, the isolate is declared to be resistant [5,6,7,8,9]. While these methods are simple to understand and apply, they often suffer from poor predictive accuracy [10], especially in identifying new resistance mechanisms or predicting resistance to rarely used drugs.
The second class, which we refer to as “machine learning methods”, seeks to infer the drug resistance of an isolate by training complex models directly on WGS and drug susceptibility test (DST) data [11,12,13]. Such methods tend to result in highly accurate predictions at the cost of flexibility and interpretability - specifically, they typically provide only limited, if any, insights into the drug resistance mechanisms involved, and often do not impose explicit limits on the predictive model’s complexity. Learning approaches based on deep neural networks [13, 14] are an example of very accurate but very complex “black-box” models of drug resistance.
In this paper we propose a novel method, based on the group testing problem [15] and Boolean compressed sensing (CS), for the prediction of drug resistance. CS is a mathematical technique for sparse signal recovery from under-determined systems of linear equations [16], and has been successfully applied in many application areas including digital signal processing [17, 18], MRI imaging [19], radar detection [20], and computational uncertainty quantification [21, 22]. Under a sparsity assumption on the unknown signal vector, it has been shown that CS techniques enable recovery from far fewer measurements than required by the Nyquist–Shannon sampling theorem [23]. Boolean CS is a modification of the CS problem, replacing linear algebra over the real numbers with Boolean algebra over binary numbers [24], which has been successfully applied to various forms of non-adaptive group testing [24,25,26].
Our approach, INterpretable GrOup Testing for Drug Resistance (INGOT-DR), combines the flexibility and interpretability of catalogue methods with the accuracy of machine learning methods. More specifically, INGOT-DR is capable of recovering interpretable rules for predicting drug resistance that both result in a high classification accuracy as well as provide insights into the mechanisms of drug resistance. We compare the performance of INGOT-DR with that of standard and state-of-the-art machine learning and rule-based learning methods which have been previously used for genotype-phenotype prediction on MTB data. These methods are logistic regression (LR) [27], random forests (RF) [28], Support Vector Machines (SVM) [29], and KOVER [30]. The comparison covers the prediction of drug resistance for twelve drugs, of which five are first-line and seven are second-line drugs. INGOT-DR displays a competitive performance while maintaining interpretability, flexibility, and accurately recovering many of the known mechanisms of drug resistance.
Methods
We present our methodology as follows. “Group testing and Boolean compressed sensing” and “From group testing to interpretable classiffication” introduce the group testing problem, and discuss how group testing can be combined with compressed sensing to deliver an interpretable predictive model. “Our approach leads to a refined ILP formulation” introduces substantial modifications to a previously published method, which are needed to produce an accurate and flexible classifier that can be tuned for specific evaluation metrics and tasks. “Optimizing different target metrics such as the sensitivity and the specificity” describes the tuning process required to provide the desired trade-off between sensitivity and specificity in a model’s predictions.
Group testing and Boolean compressed sensing
We frame the problem of predicting drug resistance given sequence data as a group testing problem, originally introduced in [15]. This approach for detecting defective members of a set was motivated by the need to screen a large population of soldier recruits for syphilis in the United States during the World War II. The screening, performed by testing blood samples, was costly due to the low numbers of infected individuals. To make the screening more efficient, Robert Dorfman suggested pooling blood samples into specific groups and testing the groups instead. A positive result for the group would imply the presence of at least one infected member. The problem then becomes one of finding the subset of individuals whose infected status can explain all the positive results without invalidating any of the negative ones.
In this setting, the design matrix encodes the individuals tested in each group, the outcome vector describes the result of each test, and the solution, obtained from a suitable algorithmic procedure, is a \(\{0,1\}\)-valued vector representing the infection status of the individuals [24, 31]. Since the fraction of infected individuals is assumed to be small, the solution vector is sparse and can be recovered with Boolean CS. The importance of this observation lies in the fact that the result of solving the Boolean CS problem can also be interpreted as a sparse set of rules for determining the status of each sample in other data mining contexts [24]. We summarize this correspondence in our context in Table 1 below, and use the context-specific interpretation throughout the rest of this paper.
Mathematically, the problem with m isolates and n SNPs can be described by the Boolean design matrix \(A \in \{0,1\}^{m \times n}\), where \(A_{ij}\) indicates the presence/absence status of SNP j in the i-th isolate, and the Boolean outcome vector \(y\in \{0,1\}^m\), where \(y_i\) represents the drug resistance phenotype of the i-th isolate. Let us define the relevance vector \(w\in \{0,1\}^n\) in such a way that \(w_j = 1\) if and only if the j-th SNP is relevant to drug resistance.
The key assumption is that one or more SNPs relevant to drug resistance can cause the isolate to be drug-resistant, whereas an isolate with no such SNP will be drug-sensitive. This is an assumption commonly made in the literature, and is precisely the same as the key assumption of group testing, which is that the presence of one or more infected individuals leads to a positive test, while a test with no infected individuals comes out negative (we note that these assumptions only hold in the absence of noise). In fact, although our group is the first one, to our knowledge, to make the connection between group testing and drug resistance prediction, a previously published method for this task [32] corresponds almost perfectly to the Definite Defectives algorithm used in group testing [33].
Under this assumption, the outcome vector satisfies the relationship
where \(\vee\) and \(\wedge\) are the Boolean OR and AND operators, respectively. Using the definition of Boolean matrix-vector multiplication, this can be equivalently written
If the status vector w satisfying Eq. (1) is assumed to be sparse (i.e. there are few relevant SNPs), the problem of finding w becomes an instance of the sparse Boolean vector recovery problem:
where \(\Vert w \Vert _0\), called the \(\ell _0\)-norm of w, is the number of non-zero entries it contains.
The combinatorial optimization problem (2) is well-known to be NP-hard [34]. In [24, 35] an equivalent formulation of (2) via 0–1 integer linear programming (ILP) is proposed, in which the \(\ell _0\)-norm is replaced by the convex \(\ell _1\)-norm, equivalent to it over binary vectors, and the Boolean matrix-vector product is replaced with equivalent linear constraints. We recapitulate their formulation here:
Here, \(\mathcal {P} := \{i : y_i = 1\}\) and \(\mathcal {Z} := \{i : y_i = 0\}\) are the sets of positive (drug-resistant) and negative (drug-sensitive) isolates, respectively, and \(A_S\) denotes the submatrix of A whose row indices are in the given subset S. In this formulation, the objective is to minimize the number of SNPs inferred to be relevant to drug resistance. The first constraint then ensures that each SNP is classified as either relevant or irrelevant, the second one ensures that the drug-resistant isolates have at least one relevant SNP present, and the third one ensures that the drug-sensitive isolates do not have any such SNPs, in line with our key assumption. This NP-hard problem formulation can further be made tractable for linear programming by relaxing the Boolean constraint on w in (3) to \(0 \le w_j \le 1\) for all \(1 \le j \le n\) [24].
Because the Boolean CS problem is based on Boolean algebra, the conditions on the Boolean matrices A that guarantee exact recovery of k-sparse status vectors (vectors with at most k 1’s) via such linear programming relaxations are quite stringent, and differ from those of standard CS. Specifically, in order to hold, these guarantees require the matrix A to be k-disjunct, i.e. for any sum of at most k of its columns to not be greater than or equal to any other column. As we have no control over A in our setting, no such recovery guarantees can be provided.
In [24], the combinatorial problem (3) is augmented with slack variables and a regularization term to trade off between the sparsity of w on the one hand, and the discrepancy between the predicted and the actual outcome vector on the other hand. With these modifications, the formulation becomes:
where \(\lambda > 0\) is a regularization parameter and \(\xi\) is the so-called slack vector. Taking this formulation as a starting point, we introduce several refinements in “Our approach leads to a refined ILP formulation”.
From group testing to interpretable classification
As described in the previous section, the solution to the ILP (4) can be seen as an interpretable rule-based classifier in contexts beyond group testing. The status vector w naturally encodes the following rule: If any feature f with \(w_f = 1\) is present in the sample, classify it as positive; otherwise, classify it as negative. More formally, assume that we have a labelled dataset \(\mathcal {D}=\{(x_1,y_1),\dots ,(x_m,y_m)\}\), where the \(x_i \in \mathcal {X} := \{0,1\}^n\) are n-dimensional binary feature vectors and the \(y_i \in \{0,1\}\) are the binary labels. The feature matrix A is defined via \(A_{ij} = (x_i)_j\) (the j-th component of the i-th feature vector). If \(\hat{w}\) is the solution of ILP (4) for this matrix A and the outcome vector \(y=(y_i)_{i=1}^m\), we define the classifier \(\hat{c}: \mathcal {X} \rightarrow \{0,1\}\) via
What makes this classifier interpretable is that it explicitly depends on the presence or absence of specific features in its input, while ignoring all the other features.
Our approach leads to a refined ILP formulation
The formulation of the ILP (4) is designed to provide, via the parameter \(\lambda\), a trade-off between the sparsity of a rule and the total slack, a quantity that resembles (but does not equal) the training error. We now describe a refinement of this formulation that directly encodes the different types of error, which provides more flexibility during the training process by allowing us to optimize a more precise objective function that is particularly suitable to the application at hand.
As was done in the previous section, we assume that \(\hat{c}\) is the binary classifier obtained by training with a Boolean feature matrix A and its corresponding label vector y. We further refer to a misclassified training sample as a false negative if it has label 1 (is in \(\mathcal {P}\)), and as a false positive if it has label 0 (is in \(\mathcal {Z}\)). In the drug resistance setting, a false negative would mean that we incorrectly predict a drug-resistant isolate to be drug-sensitive, while a false positive would mean that we predict a drug-sensitive isolate to be drug-resistant.
We begin by noting that in the ILP (4), each entry of \(\xi _\mathcal {P}\) must take on a value of 0 or 1, and a value of 1 corresponds to a false negative for \(\hat{c}\). This follows from the fact that A is a binary matrix and w is a binary vector, so the optimal \(\xi _\mathcal {P}\) is also a binary vector (since \(\lambda > 0)\), and therefore
where we use FN to denote the number of false negatives.
However, \(\xi _\mathcal {Z}\) in the ILP (4) can take on integer values greater than 1 corresponding to false positives for \(\hat{c}\). To be able to express the number of false positives, denoted FP, we modify the constraints (4d) and (4f) by also setting
and replacing the equality constraint \(A_{\mathcal {Z}}w-\xi _{\mathcal {Z}} =0\) with the tighter inequality
where \(\alpha _i =\sum _{j=1}^n A_{ij}\) and \(A_i\) is the ith row of A.
After these modifications, (8) ensures that \(\xi _{i}=1\) if \(A_{i}w > 0\), while the presence of \(\xi _{\mathcal {Z}}\) in the objective function, with \(\lambda > 0\), ensures that \(\xi _{i}=0\) if \(A_{i}w = 0\), for any \(i \in \mathcal {Z}\). We now also get
To provide additional flexibility for situations where false positives and false negatives are valued differently, we further split the regularization term into two: one for the positive class \(\mathcal {P}\), and one for the negative class \(\mathcal {Z}\):
The general form of the new ILP is now as follows:
In this new formulation, \(\lambda _\mathcal {P}\) and \(\lambda _\mathcal {Z}\) control the trade-off between the false positives and the false negatives, and jointly influence the sparsity of the rule. In the following section we describe how this formulation can be further tailored to optimize different evaluation metrics, such as the sensitivity and the specificity of the predictor.
Optimizing different target metrics such as the sensitivity and the specificity
Since the ILP formulation in (11) provides us with direct access to the two components of the training error as well as the sparsity (rule size), we may modify the classifier to optimize a variety of target metrics by transforming some of the objective function components into constraints and optimizing the remaining ones.
For instance, assume that we would like to train the classifier \(\hat{c}\) to maximize the sensitivity at a given minimum specificity \(\bar{t}\) and maximum rule size k. Recall that
From Eqs. (10), (12) and the definition of \(\mathcal {Z}\), we get the constraint
Also, to restrict the maximum rule size to k we can use the constraint
Our objective is to maximize the sensitivity, which is equivalent to minimizing \(\sum _{i\in \mathcal {P}} \xi _i\) by Eqs. (13) and (6). In addition, by incorporating Eqs. (14) and (15), the ILP (11) can be modified as follows:
The maximum specificity at given sensitivity and rule size can be found analogously. In a similar way, one can minimize a weighted average of rule size and false positive rate at a given maximum false negative rate (minimum sensitivity), or vice versa.
Implementation
Existing methods used for comparison with INGOT-DR
To ensure a fair comparison, we use three popular machine learning methods used for drug resistance prediction: random forests (RF) [28], logistic regression (LR) [27], and support vector machines (SVM) [29]. The use of RF is motivated by its flexibility and its many successful applications in computational biology and genomics [36, 37]. The use of LR is based on its excellent performance in drug resistance prediction for MTB in comparison to other methods [38]. The use of SVM is motivated by its excellent performance in a comparison of drug resistance prediction for multiple bacterial pathogens [30]; we use it with a linear kernel for simplicity, although other kernels are often used [39]. For LR and SVM, we consider the \(\ell _1\) and \(\ell _2\) regularizations, which correspond to penalizing the sum of the absolute values and the Euclidean norm of the coefficients, respectively.
We also use, to our knowledge, the only other interpretable machine learning method for drug resistance prediction, KOVER [30]. All the methods except KOVER are implemented in the Python programming language [40]. Although KOVER can provide rule-based classifiers from two algorithms: Classification and Regression Trees (CART) and Set Covering Machine (SCM), we only consider the latter as it is the main innovation of KOVER [41], and the two algorithms yield very similar accuracy [30]. We use the Scikit-learn [42] implementation for the machine learning models—RandomForestClassifier for RF, LogisticRegression for LR, and LinearSVC for SVM. We also use KOVER version 2.0 [43], and harness the Python API to the CPLEX optimizer, version 12.10.0 [44], through the Pulp API [45, 46] to solve the ILPs in INGOT-DR.
Data
We combine data from the Pathosystems Resource Integration Center (PATRIC) [47] and the Relational Sequencing TB Data Platform (ReSeqTB) [48]. This results in 8000 isolates together with their resistant/susceptible status for twelve drugs, including five first-line (rifampicin, isoniazid, pyrazinamide, ethambutol, and streptomycin) and seven second-line drugs (kanamycin, amikacin, capreomycin, ofloxacin, moxifloxacin, ciprofloxacin, and ethionamide) [49, 50]. The whole-genome sequencing data for these 8000 isolates, in the form of paired FASTQ files, are downloaded from the European Nucleotide Archive [51] and the Sequence Read Archive [52]. The accession numbers used to obtain the data in our study are: ERP[000192, 000520, 006989, 008667, 010209, 013054], PRJEB[10385, 10950, 14199, 2358, 2794, 5162, 9680], PRJNA[183624, 235615, 296471], and SRP[018402, 051584, 061066].
In order to transform the raw sequencing data into variant calls, we use a pipeline similar to that used in previous work [50, 53]. We use the BWA software [54], specifically, the BWA-MEM program, for the mapping. We then call the single-nucleotide polymorphisms (SNPs) of each isolate with two different pipelines, SAMtools [55] and GATK [56], and take the intersection of their calls to ensure reliability. The final dataset, which includes the position as well as the reference and alternative allele for each SNP [50], is used as the input to our machine learning tools.
Starting from this input we create a binary feature matrix as described in “From group testing to interpretable classiffication”. For each drug, we only consider the isolates with a status for this drug. We group all the SNPs in perfect linkage disequilibrium (LD) [57], i.e. sharing identical presence/absence patterns in those isolates, into a single feature that we call a SNP group. This representation does not affect the predictive accuracy of any machine learning methods, but helps create a consistent feature importance score for the non-interpretable ones. In KOVER, at most one SNP in a SNP group can be selected to be part of a rule, and the remaining SNPs in the group are labelled equivalent [41]; we adopt this convention here. The number of labeled and drug-resistant isolates, as well as the number of SNPs and SNP groups for each drug, is shown in Table 2.
Splitting the data into a training and testing set; tuning the hyper-parameters
To evaluate our classifier we use a random stratified train-test split, where the training set contains \(80\%\) and the testing set contains \(20\%\) of data. For hyper-parameter tuning, Scikit-learn provides two main approaches: grid search and randomized search cross-validation. KOVER is also equipped with two tuning techniques, K-fold cross-validation and risk bound selection. To make the comparison as consistent as possible, we use 5-fold cross-validation for KOVER and grid search with 5-fold cross-validation for all the other models. During cross-validation, balanced accuracy is used as the model selection metric for all the models except KOVER; to the best of our knowledge, KOVER does not provide the option to change the model selection metric.
Evaluating the models’ performance
Evaluating the performance of an interpretable predictive model can be challenging. While most evaluation methods focus on predictive accuracy, it is essential to assess the model’s interpretability as well. Although there is no consensus definition of interpretability, [58] suggest that an interpretable method should be able to provide an acceptable predictive accuracy while being easy to understand and provide meaningful insights to its audience. Adopting their idea, we evaluate the performance of our approach and the competitor methods using three metrics:
-
1.
Predictive accuracy, measured via the balanced accuracy,
$$\text {Balanced Accuracy} = \frac{\text {Sensitivity}+\text {Specificity}}{2}$$ -
2.
Simplicity, measured via the number of features (SNPs) in the trained model.
-
3.
Insight generation, measured via the relevance of the selected SNPs to known drug resistance mechanisms.
This evaluation process is demonstrated in detail in “The comparison between interpretable and non-interpretable models” and “Results”.
The comparison between interpretable and non-interpretable models
The overall pipeline consists of SNP calling and SNP grouping as described in “Data”, hyper-parameter tuning as described in “Splitting the data into a training and testing set; tuning the hyper-parameters”, and model training and testing using the balanced accuracy as the metric as described in “Evaluating the models’ performance”. This addresses the first evaluation criterion, the predictive accuracy.
To evaluate model simplicity, we investigate the SNPs selected by each model. For the rule-based classifiers, we ensure a low model complexity, and therefore a higher interpretability, by training both INGOT-DR and KOVER with the same maximum allowed rule size (number of SNPs used), k. By default, INGOT-DR also has a (training) specificity lower bound of \(\bar{t}=90\%\), via the constraint explained in “Optimizing different target metrics such as the sensitivity and the specificity”. We evaluate the simplicity of the remaining models by counting the SNPs with non-zero coefficients for LR and SVM, and the SNPs with a non-zero importance according to Scikit-learn for RF.
Lastly, to evaluate and fairly compare the models’ ability to generate insights, we compare the top k most important SNPs for each one [59]. For both INGOT-DR and KOVER, we simply evaluate the k or fewer SNPs used in each rule. Since the other machine learning methods are not inherently interpretable, we extract the SNP importance values using the Shapley additive explanation (SHAP) algorithm [60], a model-agnostic method for making explainable predictions rooted in game theory. This algorithm, implemented in the SHAP Python package, version 0.37.0 [61], provides the guaranteed unique solution satisfying three fairness conditions. We apply TreeExplainer for RF and LinearExplainer for LR and SVM, and select the k SNPs with the highest importance. We use \(k = 20\) in all our experiments.
Results
INGOT-DR produces accurate predictive models
The performance of INGOT-DR compared to that of the other methods in terms of the balanced accuracy is summarized in Table 3, and Fig. 1 separately shows the sensitivity and specificity. Overall, INGOT-DR outperforms all other models on 4/12 of the drugs, obtains the best performance (tied with KOVER) on an additional drug, and achieves a balanced accuracy within \(5\%\) of the best one for the remaining 7/12 drugs. SVM-l1 achieves the best balanced accuracy in 4/12 of the drugs, while LR-l1 and KOVER obtain the best balanced accuracy in 2/12 drugs each. Furthermore, INGOT-DR has a performance exceeding that of RF in 12/12 drugs, that of KOVER, LR-l2, and SVM-l2 in 9/12 drugs, that of LR-l1 in 8/12 drugs. SVM-l1 is the only competitive model, whose performance it only exceeds in 5/12 drugs, although it does obtain a marginally better balanced accuracy on average (85.7% vs. 85.3%).
INGOT-DR produces interpretable models
INGOT-DR produces predictive models in the form of disjunctive (logical-OR) rules over the presence of specific SNPs, as explained in “From group testing to interpretable classiffication”. These models are easy to understand and interpret. Although KOVER considers rules containing both presence and absence of features [30], the absence of a SNP is harder to interpret in the context of genomics, so we only focus on the presence of SNPs here. We note that, by DeMorgan’s law, both methods could produce conjunctive (logical-AND) rules by training the model on the complement of the feature matrix, \(\bar{A}\), and outcome vector, \(\bar{y}\); however, we focus on disjunctive rules in this paper.
We display the number of SNPs used by the predictive models produced by each method in Table 4. These results, combined with those of the previous section, suggest that INGOT-DR is producing the most interpretable models without sacrificing predictive accuracy. Although KOVER almost always produces shorter rules, they tend to not generalize as well to the testing dataset.
For a specific example, we consider the most concise model produced by INGOT-DR—the one for ciprofloxacin, a drug in the fluoroquinolone family. This model has a rule size of 5, and the SNPs used are all in the gyrA gene, known to be involved in the resistance to fluoroquinolones such as ciprofloxacin in bacteria [62]. In this example, INGOT-DR not only identifies the correct gene, but also selects mutations that are known to be associated with fluoroquinolone resistance in MTB—the selected codons, 90, 91 and 94, are among the codons most strongly associated with this type of resistance [63]. We state the rule obtained by INGOT-DR below, in a standard format specifying the gene, the original amino acid, the codon number, and the mutated amino acid.
INGOT-DR selects many SNPs in genes previously associated with drug resistance
Our results demonstrate that the models produced by INGOT-DR contain many SNPs in genes previously associated with drug resistance in MTB. This suggests that INGOT-DR not only makes accurate predictions, but that it makes them for the right reason, and could thus also be used to prioritize hypotheses about the mechanisms associated with drug resistance.
Figure 2 shows, for each of the models, the top \(k \le 20\) most important SNPs, defined as all the SNPs included in a rule by KOVER and INGOT-DR, and the top k SNPs by feature importance as defined by SHAP for the other models. We categorize each SNP according to the known information about its association with resistance to the drug of interest in MTB. This categorization is based on a list of 183 genes and 19 promoter regions selected out of over 4000 MTB genes through a data-driven and consensus-driven process by a panel of experts [64]. We use the following categories:
-
1.
Drug specific association: SNP in a gene or intergenic region associated with drug resistance to the drug of interest;
-
2.
Known association: SNP in a gene or intergenic region associated with drug resistance to any other drug;
-
3.
Unknown association: SNP in a gene not known to be associated with drug resistance to any drug;
-
4.
Intergenic association: SNP in an intergenic region not known to be associated with drug resistance to any drug.
We note that for the purposes of this categorization, whenever a group of SNPs in perfect LD was selected by the model, it was categorized according to the highest (lowest-numbered) category of any of the SNPs contained in the group. However, very few such SNP groups were selected by any of the models, and the absolute majority of the ones that were contained SNPs within the same gene.
A comparison between the methods based on Fig. 2 suggests that INGOT-DR and KOVER detect more SNPs in regions known to be associated with drug resistance than all the other methods, with INGOT-DR detecting slightly more such SNPs than KOVER on average, even after adjusting for the slightly more concise rules produced by KOVER relative to INGOT-DR. However, with the exception of the most common first-line drugs (top row) and the three fluoroquinolones (bottom row), even the interpretable methods tend to select more SNPs in parts of the genome not known to be associated with drug resistance, suggesting the potentially important effects of population structure in MTB.
Conclusion
In this paper, we introduced a new approach for creating rule-based classifiers. Our method, INGOT-DR, utilizes techniques from group testing and Boolean compressed sensing, and leverages a 0–1 ILP formulation. It produces classifiers that combine high accuracy with interpretability, and are flexible enough to be tailored for specific evaluation metrics.
We used INGOT-DR to produce classifiers for predicting drug resistance in MTB, by setting a minimum specificity of 90% and a maximum rule size of 20. We tested the classifiers’ predictive accuracy on a variety of antibiotics commonly used for treating tuberculosis, including five first-line and seven second-line drugs. We showed that INGOT-DR produces classifiers with a balanced accuracy exceeding that of other state-of-the-art rule-based and machine learning methods. In addition, we showed that INGOT-DR produces accurate models with a rule size small enough to keep the model understandable for human users. Finally, we showed that our approach generates insights by successfully identifying SNPs associated with drug resistance, as we ascertained on the specific example of ciprofloxacin.
We note that the presence of SNPs in perfect linkage disequilibrium (LD) [57], i.e. sharing identical presence/absence patterns, is common in bacteria such as MTB whose evolution is primarily clonal [65]. For this reason, while the grouping of such SNPs substantially simplifies the computational task at hand and makes it tractable, ascertaining the exact representative of each group to be selected to predict the drug resistance status of an isolate remains difficult. The presence of clonal structure within bacterial populations is a key challenge for the prediction of drug resistance, which we plan to address in future work.
In conclusion, our work has introduced a novel method, INGOT-DR, based on group testing techniques, for producing interpretable models of drug resistance, which demonstrated a state-of-the-art accuracy, descriptive ability, and relevance on an MTB dataset. In future work, we plan to address the challenges of population structure and to extend this framework to other bacteria as well as to less frequently used antimicrobial drugs. We expect our method to become a key part of the drug resistance prediction toolkit for clinical and public health microbiology researchers.
Availability of data and materials
The data used in this study is freely available from the ENA and the NCBI. The code is freely available on GitHub.
Abbreviations
- CS:
-
Compressed sensing
- FN:
-
False negatives
- FP:
-
False positives
- GWAS:
-
Genome-wide association study
- ILP:
-
Integer linear programming
- LD:
-
Linkage disequilibrium
- LR:
-
Logistic regression
- MTB:
-
Mycobacterium tuberculosis
- RF:
-
Random forests
- SCM:
-
Set covering machine
- SNP:
-
Single-nucleotide polymorphism
- SVM:
-
Support vector machine
- WGS:
-
Whole-genome sequencing
References
WHO. Antimicrobial resistance: global report on surveillance. Technical report. WHO. 2014.
WHO. Global tuberculosis report 2020. Technical report. WHO. 2020.
Raviglione MC, Smith IM. XDR tuberculosis—implications for global public health. N Engl J Med. 2007;356(7):656–9.
O’Neill J. Antimicrobial resistance: tackling a crisis for the health and wealth of nations. Review on Antimicrobial Resistance. Technical report; 2014.
Steiner A, Stucki D, Coscolla M, Borrell S, Gagneux S. KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes. BMC Genom. 2014;15:1–12.
Coll F, McNerney R, Preston M, et al. Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences. Genome Med. 2015;7:51.
Bradley P, Gordon N, Walker T, et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun. 2015;6:1–15.
Iwai H, Kato-Miyazawa M, Kirikae T, Miyoshi-Akiyama T. CASTB (the comprehensive analysis server for the Mycobacterium tuberculosis complex): a publicly accessible web server for epidemiological analyses, drug-resistance prediction and phylogenetic comparison of clinical isolates. Tuberculosis. 2015;95:843–4.
Feuerriegel S, Schleusener V, Beckert P, Kohl TA, Miotto P, Cirillo DM, Cabibbe AM, Niemann S, Fellenberg K. PhyResSE: a web tool delineating Mycobacterium culosis antibiotic resistance and lineage from whole-genome sequencing data. J Clin Microbiol. 2015;53(6):1908–14.
Schleusener V, Köser C, Beckert P, et al. Mycobacterium tuberculosis resistance prediction and lineage classification from genome sequencing: comparison of automated analysis tools. Sci Rep. 2017;7:1–9.
Yang Y, Niehaus KE, Walker TM, Iqbal Z, Walker AS, Wilson DJ, Peto TE, Crook DW, Smith EG, Zhu T, et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics. 2018;34(10):1666–71.
Drăghici S, Potter RB. Predicting HIV drug resistance with neural networks. Bioinformatics. 2003;19(1):98–107.
Arango-Argoty G, Garner E, Pruden A, Heath LS, Vikesland P, Zhang L. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome. 2018;6(1):1–15.
Chen ML, Doddi A, Royer J, Freschi L, Schito M, Ezewudo M, Kohane IS, Beam A, Farhat M. Beyond multidrug resistance: leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis resistance prediction. EBioMedicine. 2019;43:356–69.
Dorfman R. The detection of defective members of large populations. Ann Math Stat. 1943;14(4):436–40.
Foucart S, Rauhut H. A mathematical introduction to compressive sensing. In: Applied and numerical harmonic analysis. New York: Springer; 2013. https://books.google.ca/books?id=zb28BAAAQBAJ.
Eldar YC, Kutyniok G. Compressed sensing: theory and applications. Cambridge: Cambridge University Press; 2012.
Duarte MF, Eldar YC. Structured compressed sensing: from theory to applications. IEEE Trans Signal Process. 2011;59(9):4053–85.
Lustig M, Donoho D, Pauly JM. Sparse MRI: the application of compressed sensing for rapid MR imaging. Magn Resonance Med. 2007;58(6):1182–95.
Herman MA, Strohmer T. High-resolution radar via compressed sensing. IEEE Trans Signal Process. 2009;57(6):2275–84.
Mathelin L, Gallivan K. A compressed sensing approach for partial differential equations with random input data. Commun Comput Phys. 2012;12(4):919–54.
Doostan A, Owhadi H. A non-adapted sparse approximation of PDEs with stochastic inputs. J Comput Phys. 2011;230(8):3015–34.
Candes EJ, Wakin MB. An introduction to compressive sampling. IEEE Signal Process Mag. 2008;25(2):21–30.
Malioutov D, Varshney K. Exact rule learning via Boolean compressed sensing. In: International conference on machine learning; 2013. p. 765–73.
Atia GK, Saligrama V. Boolean compressed sensing and noisy group testing. IEEE Trans Inf Theory. 2012;58(3):1880–901.
Aldridge M, Johnson O, Scarlett J, et al. Group testing: an information theory perspective. Found Trends Commun Inf Theory. 2019;15(3–4):196–392.
Doerken S, Avalos M, Lagarde E, Schumacher M. Penalized logistic regression with low prevalence exposures beyond high dimensional settings. PLoS ONE. 2019;14(5):1–14.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
Drouin A, Letarte G, Raymond F, Marchand M, Corbeil J, Laviolette F. Interpretable genotype-to-phenotype classifiers with performance guarantees. Sci Rep. 2019;9(1):1–13.
Cohen A, Dahmen W, DeVore R. Compressed sensing and best \(k\)-term approximation. J Am Math Soc. 2009;22(1):211–31.
Walker TM, Kohl TA, Omar SV, Hedge J, Del Ojo Elias C, Bradley P, Iqbal Z, Feuerriegel S, Niehaus KE, Wilson DJ, et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect Dis. 2015;15(10):1193–202.
Aldridge M, Baldassini L, Johnson O. Group testing algorithms: bounds and simulations. IEEE Trans Inf Theory. 2014;60(6):3671–87.
Natarajan BK. Sparse approximate solutions to linear systems. SIAM J Comput. 1995;24(2):227–34.
Malioutov D, Malyutov M. Boolean compressed sensing: LP relaxation for group testing. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012. p. 3305–8.
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99(6):323–9.
Basu S, Kumbier K, Brown JB, Yu B. Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci. 2018;115(8):1943–8.
Kouchaki S, Yang Y, Walker TM, Sarah Walker A, Wilson DJ, Peto TE, Crook DW, Clifton DA. Application of machine learning techniques to tuberculosis drug resistance analysis. Bioinformatics. 2019;35(13):2276–82.
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. COLT ’92. New York: Association for Computing Machinery; 1992. p. 144–52.
van Rossum G. Python tutorial. Technical report CS-R9526, Centrum voor Wiskunde en Informatica (CWI). Amsterdam; 1995.
Drouin A, Giguère S, Déraspe M, Marchand M, Tyers M, Loo VG, Bourgault A-M, Laviolette F, Corbeil J. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genom. 2016;17(1):754.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Drouin A. Learn interpretable computational phenotyping models from \(k\)-merized genomic data; 2020. https://github.com/aldro61/kover.
IBM. IBM ILOG CPLEX optimization studio V12.10.0 documentation. International Business Machines Corporation. 2020.
Mitchell S, O’Sullivan M, Dunning I. PuLP: a linear programming toolkit for Python. 2011. http://www.optimization-online.org/DB_FILE/2011/09/3178.pdf.
Lougee-Heimer R. The common optimization interface for operations research: promoting open-source software in the operations research community. IBM J Res Dev. 2003;47(1):57–66. https://doi.org/10.1147/rd.471.0057.
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42(D1):581–91.
Starks AM, Avilés E, Cirillo DM, Denkinger CM, Dolinger DL, Emerson C, Gallarda J, Hanna D, Kim PS, Liwski R, et al. Collaborative effort for a centralized worldwide tuberculosis relational sequencing data platform. Clin Infect Dis. 2015;61(suppl_3):141–6.
Ngo T-M, Teo Y-Y. Genomic prediction of tuberculosis drug-resistance: benchmarking existing databases and prediction algorithms. BMC Bioinform. 2019;20(1):68.
Deelder W, Christakoudi S, Phelan J, Diez Benavente E, Campino S, McNerney R, Palla L, Clark TG. Machine learning predicts accurately Mycobacterium tuberculosis drug resistance from whole genome sequencing data. Front Genet. 2019;10:922.
Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, et al. The European nucleotide archive. Nucleic Acids Res. 2011;39:28–31.
Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucleic Acids Res. 2010;39(suppl_1):19–21.
Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão JA, Viveiros M, Portugal I, Pain A, Martin N, Clark TG. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014;5:1–5.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv; 2013.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, der Auwera GAV, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017.
San JE, Baichoo S, Kanzi A, Moosa Y, Lessells R, Fonseca V, Mogaka J, Power R, de Oliveira T. Current affairs of microbial genome-wide association studies: approaches, bottlenecks and analytical pitfalls. Front Microbiol. 2020;10:3119.
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Interpretable machine learning: definitions, methods, and applications. arXiv; 2019.
Saber MM, Shapiro BJ. Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes. Microb Genom. 2020;6(3):000337.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems, vol. 30; 2017. p. 4765–74.
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020;2(1):2522–5839.
Drlica K, Zhao X. DNA gyrase, topoisomerase IV, and the 4-quinolones. Microbiol Mol Biol Rev. 1997;61(3):377–92.
Avalos E, Catanzaro D, Catanzaro A, Ganiats T, Brodine S, Alcaraz J, Rodwell T. Frequency and geographic distribution of gyra and gyrb mutations associated with fluoroquinolone resistance in clinical Mycobacterium tuberculosis isolates: a systematic review. PLoS ONE. 2015;10(3):0120470.
Miotto P, Tessema B, Tagliani E, Chindelevitch L, et al. A standardised method for interpreting the association between mutations and phenotypic drug-resistance in Mycobacterium tuberculosis. Eur Respir J. 2017;50(6):170.
Gagneux S. Ecology and evolution of Mycobacterium tuberculosis. Nat Rev Microbiol. 2018;16:202–13.
Acknowledgements
The authors would like to acknowledge Dr. Cedric Chauve, Dr. Ben Adcock and Matthew Nguyen for helpful discussions, and the feedback from Dr. Nicholas Croucher, Dr. John Lees, Dr. Tim Walker, and Dr. Zamin Iqbal.
Funding
This project was funded by a Genome Canada grant, “Machine learning to predict drug resistance in pathogenic bacteria”. LC acknowledges funding from the MRC Centre for Global Infectious Disease Analysis (MR/R015600/1), funded by the UK Medical Research Council (MRC) and the UK Foreign, Commonwealth & Development Office (FCDO) under the MRC/FCDO Concordat agreement, and is part of the EDCTP2 program supported by the EU.
Author information
Authors and Affiliations
Contributions
HZ has conceptualized and implemented the method, carried out the experiments, and analyzed the results. ND has contributed to the conceptualization and implementation. AS and NF have contributed to the data collection, preprocessing and analysis. ML and LC have contributed to conceptualizing the method, and supervised the research. All the authors have additionally contributed to writing or editing the draft and the final version of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
No ethics approval was required for this study.
Consent for publication
All authors have consented to this publication.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zabeti, H., Dexter, N., Safari, A.H. et al. INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis. Algorithms Mol Biol 16, 17 (2021). https://doi.org/10.1186/s13015-021-00198-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015-021-00198-1