Keywords

1 Introduction and Background

Drug design is a complex process, with costs exceeding $4 billion and a decade of development time required to bring a new drug to market (Schlander et al., 2021). Despite this investment, a vast majority of drugs never make it to clinical trials and of those drugs that do enter clinical trials a staggering 90% of drugs fail (Sun et al., 2022), with 50% of failures attributed to unexpected human toxicity (Van Norman, 2019). Traditional toxicological studies rely on animal models at the preclinical stage, yet these models face limitations in reliability, time, and ethical concerns, with their translational relevance to humans remaining uncertain (Raies and Bajic, 2016).

The adoption of the 3R principles (Replace, Reduce, Refine) to curtail animal testing has catalyzed the development of in vitro methods for toxicological assessment of new compounds (Choudhuri et al., 2018). In the early phases of drug discovery, multiple cytotoxicity assays measure the impact of chemical compounds on cellular structure and function, providing early indications of potential tissue and organ toxicity (Ballantyne, 2006; Tabernilla et al., 2021).

Well-designed in vitro experiments can reduce the reliance on animal testing. An experiment is a systematic procedure aimed at collecting scientific data to test hypotheses or generate new ones. Common experimental designs include completely randomized experiments or randomized block testing (Festing, 2001).

In contexts such as high throughput screening (HTS) and toxicity assays, where exhaustive search is infeasible due to the vast number of possible combinations, efficient experimental design is paramount (Niedz and Evens, 2016). It’s simply not feasible to test every drug against each target. Bayesian experimental design (BeD) emerges as a powerful tool in this regard, reducing the required number of experiments (Khan et al., 2023). BeD achieves this by providing hypothetical experimental options based on the outcomes of previous ones, thereby potentially curtailing costs and expediting the drug discovery process (Daly et al., 2019; Bader et al., 2023).

In-silico methods are often used in conjunction with in-vitro studies to model the behaviour of biological systems by leveraging available experimental data (Merino-Casallo et al., 2018; Abd El Hafez et al., 2022). Bayesian methods have been applied to select the optimal parameters of the in-vitro experiments (Pauwels et al., 2014; Johnston et al., 2016), parameters estimation of mechanistic models (Merino-Casallo et al., 2018; Demetriades et al., 2022), estimating drug synergies (Cremaschi et al., 2019; Rønneberg et al., 2021), and computing in-vitro dose response curves (Hennessey et al., 2010). It still remains unclear which experiment to conduct next in order to obtain the most informative data point for inclusion in the subsequent iteration of training, aimed at enhancing the overall performance of these in-silico models. Addressing this challenge, we borrowed Bayesian methods for experimental design from the computer vision community and applied to model toxicity endpoints.

Fig. 1.
figure 1

Mean average precision across 12 tasks of Tox21 dataset.Active learning with pretrained BERT features outperforms models trained on ECFP. Furthermore, BALD and EPIG acquisition functions select more informative samples than uniform (random) sampling, with EPIG demonstrating a slight superiority over BALD

2 Methods

2.1 Bayesian Active Learning

We first consider fully supervised learning tasks, e.g., estimating molecular properties, using a probabilistic model with likelihood function \(p(y|\boldsymbol{x}, \phi )\), where \(\boldsymbol{x}\) is an input, y is the output, and \(\phi \) is the parameter of the model \(f(\boldsymbol{x}; \phi )\) which has a prior distribution \(p(\phi )\) and a posterior \(p(\phi |\mathcal {D})\) given a labelled training set \(\mathcal {D}=\{(\boldsymbol{x}_i, y_i)\}_{i=1}^{N}\). In active learning or experimental design (Rainforth et al., 2024), we have access to another unlabelled set \(\mathcal {D}_u=\{(\boldsymbol{x}_i^u)\}_{i=1}^{N_u}\) and select which labels to acquire when training the model \(f(\boldsymbol{x}; \phi )\) by maximizing an acquisition function that captures the expected utility of acquiring the label \(y_s\) of the selected input \(\boldsymbol{x}_s\). Then the new labelled data \((\boldsymbol{x}_s^u, y_s)\) is incorporated into the training set \(\mathcal {D}=\mathcal {D}\bigcup \{(\boldsymbol{x}_s^u, y_s)\}\) and the probabilistic model, i.e., the posterior \(p(\phi |\mathcal {D})\), is updated accordingly.

Acquisition Function: BALD One popular acquisition function is Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011), which is the expected information gain, measured by the reduction in Shannon entropy, of the model parameter \(\phi \) from labelling \(\boldsymbol{x}\) across all possible realisations of its label y given by \(p(y|\boldsymbol{x},\mathcal {D})\). Specifically, we have

$$\begin{aligned} \begin{aligned} \text {BALD}(\boldsymbol{x})&=\mathbb {E}_{y\sim p(y|\boldsymbol{x}, \mathcal {D})}\left[ {\textrm{H}}[\phi |\mathcal {D}]-{\textrm{H}}[\phi |\boldsymbol{x},y,\mathcal {D}]\right] \\ &={\textrm{H}}[y|\boldsymbol{x},\mathcal {D}]-\mathbb {E}_{\phi \sim p(\phi |\mathcal {D})}\left[ {\textrm{H}}[y|\boldsymbol{x},\phi ]\right] \end{aligned} \end{aligned}$$
(1)

with the optimal design \(\boldsymbol{x}^{\star }=\mathop {\mathrm {arg\,max}}\limits _{\boldsymbol{x}}\text {BALD}(\boldsymbol{x})\). The first term in BALD measures the total uncertainty on \(\boldsymbol{x}\) while the second term measures its aleatoric uncertainty, i.e., the irreducible uncertainty from observational noise. Therefore, BALD selects \(\boldsymbol{x}\) with the highest epistemic uncertainty, i.e., the reducible uncertainty from the lack of data (Kendall and Gal, 2017).

Acquisition Function: EPIG BALD targets a global uncertainty reduction on the parameter space \(\phi \). However, in most supervised learning tasks, users are interested in improving the model accuracy on a target set \(p(\boldsymbol{x}_*)\), e.g., the test set. Therefore, recent work (Smith et al., 2023a) claimed that an acquisition function, Expected Predictive Information Gain (EPIG), explicitly reducing the model output uncertainty on random samples from \(p(\boldsymbol{x}_*)\) is more effective than BALD in improving the model performance. Specifically, as discussed and defined in (Smith et al., 2023b) \(\text {EPIG}(\boldsymbol{x})=\)

$$\begin{aligned} \begin{aligned} \mathbb {E}_{p(\boldsymbol{x}_*)}\left[ {\textrm{H}}[y_*|\boldsymbol{x}_*, \mathcal {D}]-\mathbb {E}_{p(y|\boldsymbol{x},\mathcal {D})}\left[ {\textrm{H}}[y_*|\boldsymbol{x}_*, y, \boldsymbol{x}]\right] \right] \end{aligned} \end{aligned}$$
(2)

is expected reduction of the “expected predictive uncertainty” over the target input distribution \(p(\boldsymbol{x}_*)\) by observing the label of \(\boldsymbol{x}\). Intuitively, compared with BALD which reduces the parameter uncertainty globally, EPIG only reduces the parameter uncertainty that reduces model output uncertainty on \(p(\boldsymbol{x}_*)\).

Semi-supervised Active Learning (SSAL). In the fully supervised scenario, the model \(f(\boldsymbol{x};\phi )\) only learns from the labelled dataset \(\mathcal {D}\). This is inefficient in active learning because the labelled dataset for training is limited initially, and active learning has to collect more data to learn a good input manifold, which is required to estimate the uncertainty of downstream tasks (Smith et al., 2024). This is particularly challenging in the chemical space, where the input manifold is nontrivial (Zhou et al., 2019). Therefore, researchers proposed semi-supervised active learning (SSAL) approaches (Zhang et al., 2019; Hao et al., 2020) to learn the representations of input molecules using both labelled and unlabeled data and conduct active learning on the representation space with the labelled data. However, the available molecules in most public molecular property datasets are still limited (ranging from hundreds to thousands), even without labels.

In this paper, we propose to use molecular representations from a pretrained self-supervised learning model. Specifically, we encoded the molecular SMILES sequences into corresponding embeddings, utilizing a large transformer model MolBERT, pretrained on 1.6 million SMILES via masking, alongside physicochemical properties (Fabian et al., 2020) . The embedding of each SMILES sequence is a pooled output from the pretrained MolBERT with dimension 764. We employed these embeddings from MolBERT to train a fully connected (i.e., MLP) head. This strategy allowed us to leverage a significant volume of molecule data, offering particular benefits for conducting active learning on relatively small datasets.

Fig. 2.
figure 2

Sum of samples across 12 tasks of Tox21 dataset. EPIG and BALD are acquiring more positive sample than random acquisition

2.2 Practical Bayesian Neural Networks

In this work, we use a Bayesian neural network to account for the model uncertainty. According to recent research on dropout variational inference (Gal and Ghahramani, 2016), a practical Bayesian neural network for a wide variety of architectures can be obtained by simply training a neural network with dropout (MC dropout), and interpreting this as being equivalent to variational inference (Blei et al., 2017). The uncertainty is then estimated by using the multiple forward-passing with different dropout masks. Although the uncertainty from MC dropout is often underestimated, it has been a popular choice for Bayesian active learning with neural networks and shows promise on real-world datasets (Gal et al., 2017; Rakesh and Jain, 2021).

This neural network architecture consists of an input-hidden-output layers, where \(\boldsymbol{x}_0\) is initialized as the input features \(\boldsymbol{x}\), which can be either BERT features (in the semi-supervised AL) or binary fingerprints (in the supervised AL). We utilize dropout for uncertainty estimation, batch normalization for training stability, and the rectified linear unit (ReLU) activation function as the default activation. Additionally, the network incorporates a skip connection, merging the input and output of the hidden layer, enhancing information flow. Finally, the output layer generates logits, which can be transformed into probabilities by passing through a sigmoidal activation function.

$$\begin{aligned} \begin{aligned} &\boldsymbol{x}_0 = \boldsymbol{x}\quad \texttt {BERT features or ECFP}\\ &\boldsymbol{x}_{\ell } = \text {Dropout}(\text {ReLU}(\text {BatchNorm}(W_\ell \boldsymbol{x}_0 + \textbf{b}_\ell )))\\ &\tilde{\boldsymbol{x}}_{\ell +1} = \text {BatchNorm}(W_{\ell +1} \boldsymbol{x}_{\ell } + \textbf{b}_{\ell +1}) \\ &\boldsymbol{x}_{\ell +1} = \text {Dropout}(\text {ReLU}(\boldsymbol{x}_{\ell } + \tilde{\boldsymbol{x}}_{\ell +1}))\\ &x_{out} = W_{\ell +2} \boldsymbol{x}_{\ell + 1} + \textbf{b}_{\ell +1} \end{aligned} \end{aligned}$$
(3)

The hyper-parameters of this model are given in Table 1.

3 Experiments

3.1 Dataset

Tox21. The Tox21 dataset, or Toxicology in the 21st Century dataset, is a publicly available dataset used in the field of computational toxicology (Richard et al., 2021). The Tox21 dataset consists of a large collection of chemical compounds, each of which is associated with various types of toxicity outcomes. These outcomes are typically measured using high-throughput screening assays to evaluate the potential toxic effects of the compounds. The dataset provides a quantitative assessment (in form of binary labels) of toxicity of \(\approx \) 8000 compounds in 12 different toxicity pathways.

Table 1. Hyperparameters used of BNN and training

The Tox21 dataset is widely used as a benchmark in the development of in silico toxicology models. In this dataset, 6.24% measurements are active (ranges from 2% to 12%), 73% are inactive, while 20.56% are missing values as shown in Fig. 3.

Fig. 3.
figure 3

The output space of Tox21, displaying active compounds in red, inactive compounds in blue, and missing data points in white. (Color figure online)

3.2 Data Splitting

Test, Train Set. For the better of evaluation of generalization, we employed scaffold splitting with 80:20 ratio to create distinct training and testing sets. Scaffold splitting partitions a molecular dataset according to core structural motifs identified by the Bemis-Murcko scaffold representation (Bemis and Murcko, 1996), prioritizing larger groups while ensuring that the train and test sets do not share identical scaffolds. The testset for all the experiments is identical.

Initial and Pool Sets. A balanced initial set was constructed by randomly selecting 100 molecules from the training set, with equal representation of positive and negative instances. Subsequently, a pool set was generated by excluding the initial set from the training set.

3.3 Baselines

We consider three acquisition functions, random, BALD, and EPIG (Sect. 2.1), and two learning paradigms, supervised active learning (AL) and semi-supervised active learning (SSAL). In SSAL, we use the BERT features pretrained on 1.6 million SMILES, and in AL, we use ECFP, or Extended-Connectivity Fingerprints, directly. ECFP is a method used in cheminformatics to represent molecular structures as binary fingerprints, capturing structural information by encoding the presence or absence of substructural features within a specified radius around each atom. Through iterative traversal of the molecular structure, unique substructural fragments are identified and hashed into a fixed-length bit vector, generating a binary fingerprint where each bit indicates the presence or absence of a specific substructural fragment. We encoded each molecule into a fixed 1024-dimensional binary vector using a radius of 6.

4 Results and Discussions

We began by training separate neural networks for each task, starting with an initial set of 100 molecules. Then, we iteratively chose the next molecule based on acquisition functions (BALD, EPIG, and random) for 200 iterations, evaluating the test set after each round. Our study compared active learning strategies using both ECFP and BERT features. We repeated this process with 5 different seeds, showing the average precision (AUPR) performance evolution across iterations (Fig. 1). Notably, active learning with pretrained BERT features outperformed models trained on ECFP. Additionally, BALD and EPIG acquisition functions consistently selected more informative samples than uniform (random) sampling, with EPIG showing a slight edge over BALD. Many learning algorithms face challenges in effectively learning from imbalanced datasets, where the dominance of the majority class can overwhelm the learning process. As illustrated in Fig. 2, our analysis demonstrates that both EPIG and BALD consistently acquire a higher proportion of positive samples compared to random acquisition. This observation holds particular significance in the modeling of toxicity datasets.