Abstract
In drug discovery, prioritizing compounds for testing is an important task. Active learning can assist in this endeavor by prioritizing molecules for label acquisition based on their estimated potential to enhance in-silico models. However, in specialized cases like toxicity modeling, limited dataset sizes can hinder effective training of modern neural networks for representation learning and to perform active learning. In this study, we leverage a transformer-based BERT model pretrained on millions of SMILES to perform active learning. Additionally, we explore different acquisition functions to assess their compatibility with pretrained BERT model. Our results demonstrate that pretrained models enhance active learning outcomes. Furthermore, we observe that active learning selects a higher proportion of positive compounds compared to random acquisition functions, an important advantage, especially in dealing with imbalanced toxicity datasets. Through a comparative analysis, we find that both BALD and EPIG acquisition functions outperform random acquisition, with EPIG exhibiting slightly superior performance over BALD. In summary, our study highlights the effectiveness of active learning in conjunction with pretrained models to tackle the problem of data scarcity.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction and Background
Drug design is a complex process, with costs exceeding $4 billion and a decade of development time required to bring a new drug to market (Schlander et al., 2021). Despite this investment, a vast majority of drugs never make it to clinical trials and of those drugs that do enter clinical trials a staggering 90% of drugs fail (Sun et al., 2022), with 50% of failures attributed to unexpected human toxicity (Van Norman, 2019). Traditional toxicological studies rely on animal models at the preclinical stage, yet these models face limitations in reliability, time, and ethical concerns, with their translational relevance to humans remaining uncertain (Raies and Bajic, 2016).
The adoption of the 3R principles (Replace, Reduce, Refine) to curtail animal testing has catalyzed the development of in vitro methods for toxicological assessment of new compounds (Choudhuri et al., 2018). In the early phases of drug discovery, multiple cytotoxicity assays measure the impact of chemical compounds on cellular structure and function, providing early indications of potential tissue and organ toxicity (Ballantyne, 2006; Tabernilla et al., 2021).
Well-designed in vitro experiments can reduce the reliance on animal testing. An experiment is a systematic procedure aimed at collecting scientific data to test hypotheses or generate new ones. Common experimental designs include completely randomized experiments or randomized block testing (Festing, 2001).
In contexts such as high throughput screening (HTS) and toxicity assays, where exhaustive search is infeasible due to the vast number of possible combinations, efficient experimental design is paramount (Niedz and Evens, 2016). It’s simply not feasible to test every drug against each target. Bayesian experimental design (BeD) emerges as a powerful tool in this regard, reducing the required number of experiments (Khan et al., 2023). BeD achieves this by providing hypothetical experimental options based on the outcomes of previous ones, thereby potentially curtailing costs and expediting the drug discovery process (Daly et al., 2019; Bader et al., 2023).
In-silico methods are often used in conjunction with in-vitro studies to model the behaviour of biological systems by leveraging available experimental data (Merino-Casallo et al., 2018; Abd El Hafez et al., 2022). Bayesian methods have been applied to select the optimal parameters of the in-vitro experiments (Pauwels et al., 2014; Johnston et al., 2016), parameters estimation of mechanistic models (Merino-Casallo et al., 2018; Demetriades et al., 2022), estimating drug synergies (Cremaschi et al., 2019; Rønneberg et al., 2021), and computing in-vitro dose response curves (Hennessey et al., 2010). It still remains unclear which experiment to conduct next in order to obtain the most informative data point for inclusion in the subsequent iteration of training, aimed at enhancing the overall performance of these in-silico models. Addressing this challenge, we borrowed Bayesian methods for experimental design from the computer vision community and applied to model toxicity endpoints.
2 Methods
2.1 Bayesian Active Learning
We first consider fully supervised learning tasks, e.g., estimating molecular properties, using a probabilistic model with likelihood function \(p(y|\boldsymbol{x}, \phi )\), where \(\boldsymbol{x}\) is an input, y is the output, and \(\phi \) is the parameter of the model \(f(\boldsymbol{x}; \phi )\) which has a prior distribution \(p(\phi )\) and a posterior \(p(\phi |\mathcal {D})\) given a labelled training set \(\mathcal {D}=\{(\boldsymbol{x}_i, y_i)\}_{i=1}^{N}\). In active learning or experimental design (Rainforth et al., 2024), we have access to another unlabelled set \(\mathcal {D}_u=\{(\boldsymbol{x}_i^u)\}_{i=1}^{N_u}\) and select which labels to acquire when training the model \(f(\boldsymbol{x}; \phi )\) by maximizing an acquisition function that captures the expected utility of acquiring the label \(y_s\) of the selected input \(\boldsymbol{x}_s\). Then the new labelled data \((\boldsymbol{x}_s^u, y_s)\) is incorporated into the training set \(\mathcal {D}=\mathcal {D}\bigcup \{(\boldsymbol{x}_s^u, y_s)\}\) and the probabilistic model, i.e., the posterior \(p(\phi |\mathcal {D})\), is updated accordingly.
Acquisition Function: BALD One popular acquisition function is Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011), which is the expected information gain, measured by the reduction in Shannon entropy, of the model parameter \(\phi \) from labelling \(\boldsymbol{x}\) across all possible realisations of its label y given by \(p(y|\boldsymbol{x},\mathcal {D})\). Specifically, we have
with the optimal design \(\boldsymbol{x}^{\star }=\mathop {\mathrm {arg\,max}}\limits _{\boldsymbol{x}}\text {BALD}(\boldsymbol{x})\). The first term in BALD measures the total uncertainty on \(\boldsymbol{x}\) while the second term measures its aleatoric uncertainty, i.e., the irreducible uncertainty from observational noise. Therefore, BALD selects \(\boldsymbol{x}\) with the highest epistemic uncertainty, i.e., the reducible uncertainty from the lack of data (Kendall and Gal, 2017).
Acquisition Function: EPIG BALD targets a global uncertainty reduction on the parameter space \(\phi \). However, in most supervised learning tasks, users are interested in improving the model accuracy on a target set \(p(\boldsymbol{x}_*)\), e.g., the test set. Therefore, recent work (Smith et al., 2023a) claimed that an acquisition function, Expected Predictive Information Gain (EPIG), explicitly reducing the model output uncertainty on random samples from \(p(\boldsymbol{x}_*)\) is more effective than BALD in improving the model performance. Specifically, as discussed and defined in (Smith et al., 2023b) \(\text {EPIG}(\boldsymbol{x})=\)
is expected reduction of the “expected predictive uncertainty” over the target input distribution \(p(\boldsymbol{x}_*)\) by observing the label of \(\boldsymbol{x}\). Intuitively, compared with BALD which reduces the parameter uncertainty globally, EPIG only reduces the parameter uncertainty that reduces model output uncertainty on \(p(\boldsymbol{x}_*)\).
Semi-supervised Active Learning (SSAL). In the fully supervised scenario, the model \(f(\boldsymbol{x};\phi )\) only learns from the labelled dataset \(\mathcal {D}\). This is inefficient in active learning because the labelled dataset for training is limited initially, and active learning has to collect more data to learn a good input manifold, which is required to estimate the uncertainty of downstream tasks (Smith et al., 2024). This is particularly challenging in the chemical space, where the input manifold is nontrivial (Zhou et al., 2019). Therefore, researchers proposed semi-supervised active learning (SSAL) approaches (Zhang et al., 2019; Hao et al., 2020) to learn the representations of input molecules using both labelled and unlabeled data and conduct active learning on the representation space with the labelled data. However, the available molecules in most public molecular property datasets are still limited (ranging from hundreds to thousands), even without labels.
In this paper, we propose to use molecular representations from a pretrained self-supervised learning model. Specifically, we encoded the molecular SMILES sequences into corresponding embeddings, utilizing a large transformer model MolBERT, pretrained on 1.6 million SMILES via masking, alongside physicochemical properties (Fabian et al., 2020) . The embedding of each SMILES sequence is a pooled output from the pretrained MolBERT with dimension 764. We employed these embeddings from MolBERT to train a fully connected (i.e., MLP) head. This strategy allowed us to leverage a significant volume of molecule data, offering particular benefits for conducting active learning on relatively small datasets.
2.2 Practical Bayesian Neural Networks
In this work, we use a Bayesian neural network to account for the model uncertainty. According to recent research on dropout variational inference (Gal and Ghahramani, 2016), a practical Bayesian neural network for a wide variety of architectures can be obtained by simply training a neural network with dropout (MC dropout), and interpreting this as being equivalent to variational inference (Blei et al., 2017). The uncertainty is then estimated by using the multiple forward-passing with different dropout masks. Although the uncertainty from MC dropout is often underestimated, it has been a popular choice for Bayesian active learning with neural networks and shows promise on real-world datasets (Gal et al., 2017; Rakesh and Jain, 2021).
This neural network architecture consists of an input-hidden-output layers, where \(\boldsymbol{x}_0\) is initialized as the input features \(\boldsymbol{x}\), which can be either BERT features (in the semi-supervised AL) or binary fingerprints (in the supervised AL). We utilize dropout for uncertainty estimation, batch normalization for training stability, and the rectified linear unit (ReLU) activation function as the default activation. Additionally, the network incorporates a skip connection, merging the input and output of the hidden layer, enhancing information flow. Finally, the output layer generates logits, which can be transformed into probabilities by passing through a sigmoidal activation function.
The hyper-parameters of this model are given in Table 1.
3 Experiments
3.1 Dataset
Tox21. The Tox21 dataset, or Toxicology in the 21st Century dataset, is a publicly available dataset used in the field of computational toxicology (Richard et al., 2021). The Tox21 dataset consists of a large collection of chemical compounds, each of which is associated with various types of toxicity outcomes. These outcomes are typically measured using high-throughput screening assays to evaluate the potential toxic effects of the compounds. The dataset provides a quantitative assessment (in form of binary labels) of toxicity of \(\approx \) 8000 compounds in 12 different toxicity pathways.
The Tox21 dataset is widely used as a benchmark in the development of in silico toxicology models. In this dataset, 6.24% measurements are active (ranges from 2% to 12%), 73% are inactive, while 20.56% are missing values as shown in Fig. 3.
3.2 Data Splitting
Test, Train Set. For the better of evaluation of generalization, we employed scaffold splitting with 80:20 ratio to create distinct training and testing sets. Scaffold splitting partitions a molecular dataset according to core structural motifs identified by the Bemis-Murcko scaffold representation (Bemis and Murcko, 1996), prioritizing larger groups while ensuring that the train and test sets do not share identical scaffolds. The testset for all the experiments is identical.
Initial and Pool Sets. A balanced initial set was constructed by randomly selecting 100 molecules from the training set, with equal representation of positive and negative instances. Subsequently, a pool set was generated by excluding the initial set from the training set.
3.3 Baselines
We consider three acquisition functions, random, BALD, and EPIG (Sect. 2.1), and two learning paradigms, supervised active learning (AL) and semi-supervised active learning (SSAL). In SSAL, we use the BERT features pretrained on 1.6 million SMILES, and in AL, we use ECFP, or Extended-Connectivity Fingerprints, directly. ECFP is a method used in cheminformatics to represent molecular structures as binary fingerprints, capturing structural information by encoding the presence or absence of substructural features within a specified radius around each atom. Through iterative traversal of the molecular structure, unique substructural fragments are identified and hashed into a fixed-length bit vector, generating a binary fingerprint where each bit indicates the presence or absence of a specific substructural fragment. We encoded each molecule into a fixed 1024-dimensional binary vector using a radius of 6.
4 Results and Discussions
We began by training separate neural networks for each task, starting with an initial set of 100 molecules. Then, we iteratively chose the next molecule based on acquisition functions (BALD, EPIG, and random) for 200 iterations, evaluating the test set after each round. Our study compared active learning strategies using both ECFP and BERT features. We repeated this process with 5 different seeds, showing the average precision (AUPR) performance evolution across iterations (Fig. 1). Notably, active learning with pretrained BERT features outperformed models trained on ECFP. Additionally, BALD and EPIG acquisition functions consistently selected more informative samples than uniform (random) sampling, with EPIG showing a slight edge over BALD. Many learning algorithms face challenges in effectively learning from imbalanced datasets, where the dominance of the majority class can overwhelm the learning process. As illustrated in Fig. 2, our analysis demonstrates that both EPIG and BALD consistently acquire a higher proportion of positive samples compared to random acquisition. This observation holds particular significance in the modeling of toxicity datasets.
References
Abd El Hafez, M.S., et al.: Characterization, in-silico, and in-vitro study of a new steroid derivative from Ophiocoma dentata as a potential treatment for COVID-19. Sci. Rep. 12(1), 5846 (2022). ISSN 2045-2322. https://doi.org/10.1038/s41598-022-09809-2, https://www.nature.com/articles/s41598-022-09809-2. Publisher: Nature Publishing Group
Bader, J., Narayanan, H., Arosio, P., Leroux, J.C.: Improving extracellular vesicles production through a Bayesian optimization-based experimental design. Eur. J. Pharm. Biopharm. 182, 103–114 (2023). ISSN 0939-6411. https://doi.org/10.1016/j.ejpb.2022.12.004, https://www.sciencedirect.com/science/article/pii/S0939641122002983
BALLANTYNE, B.: Local and systemic ophthalmic pharmacology and toxicology of organophosphate and carbamate anticholinesterases. In: Toxicology of Organophosphate & Carbamate Compounds, pp. 423–445. Elsevier, 2006. ISBN 978-0-12-088523-7. https://doi.org/10.1016/B978-012088523-7/50032-6, https://linkinghub.elsevier.com/retrieve/pii/B9780120885237500326
Bemis, G.W., Murcko, M.A.: The properties of known drugs. 1. molecular frameworks. J. Med. Chem. 39(15), 2887–2893 (1996). ISSN 0022-2623. https://doi.org/10.1021/jm9602928. Publisher: American Chemical Society
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Choudhuri, S., Patton, G.W., Chanderbhan, R.F., Mattia, A., Klaassen, C.D.: From classical toxicology to Tox21: some critical conceptual and technological advances in the molecular understanding of the toxic response beginning from the last quarter of the 20th century. Toxicol. Sci. 161(1), 5–22 (2018). ISSN 1096-6080, 1096-0929. https://doi.org/10.1093/toxsci/kfx186, https://academic.oup.com/toxsci/article/161/1/5/4102075
Cremaschi, A., Frigessi, A., Taskén, K., Zucknick, M.: A Bayesian approach to study synergistic interaction effects in in-vitro drug combination experiments. arXiv:1904.04901 (2019)
Daly, A.J., Stock, M., Baetens, J.M., De Baets, B.: Guiding mineralization co-culture discovery using bayesian optimization. Environ. Sci. Technol. 53(24), 14459–14469 (2019). ISSN 0013-936X. https://doi.org/10.1021/acs.est.9b05942. Publisher: American Chemical Society
Demetriades, M., et al.: Interrogating and quantifying in vitro cancer drug pharmacodynamics via agent-based and bayesian monte carlo modelling. Pharmaceutics 14(4), 749 (2022). ISSN 1999-4923. https://doi.org/10.3390/pharmaceutics14040749, https://www.mdpi.com/1999-4923/14/4/749. Number: 4 Publisher: Multidisciplinary Digital Publishing Institute
Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M.: Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv:2011.13230 (2020)
Festing, M.F.: Guidelines for the design and statistical analysis of experiments in papers submitted to ATLA. Alternatives to Laboratory Animals (2001). https://doi.org/10.1177/026119290102900409, https://journals.sagepub.com/doi/10.1177/026119290102900409. Publisher: SAGE PublicationsSage UK: London, England
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1050–1059 (2016)
Gal, Y., Islam, R. and Ghahramani, Z.: Deep Bayesian active learning with image data. In: International Conference on Machine Learning, pp. 1183–1192. PMLR (2017)
Hao, Z., et al.: ASGN: an active semi-supervised graph neural network for molecular property prediction. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 731–752 (2020)
Hennessey, V.G., Rosner, G.L., Bast Jr, R.C., Chen, M.Y.: A Bayesian approach to dose-response assessment and synergy and its application to in vitro dose-response studies. Biometrics, 66(4), 1275–1283 (2010). ISSN 0006-341X. https://doi.org/10.1111/j.1541-0420.2010.01403.x
Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745 (2011)
Johnston, S.T., Ross, J.V., Binder, B.J., McElwain, D.S., Haridas, P., Simpson, M.J.: Quantifying the effect of experimental design choices for in vitro scratch assays. J. Theor. Biol. 400, 19–31 (2016). ISSN 0022-5193. https://doi.org/10.1016/j.jtbi.2016.04.012, https://www.sciencedirect.com/science/article/pii/S0022519316300406
Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Khan, A., et al.: Toward real-world automated antibody design with combinatorial Bayesian optimization. Cell Rep. Methods 3(1) (2023.) ISSN 2667-2375. https://doi.org/10.1016/j.crmeth.2022.100374, https://www.cell.com/cell-reports-methods/abstract/S2667-2375(22)00276-4. Publisher: Elsevier
Merino-Casallo, F., Gomez-Benito, M.J., Juste-Lanas, Y., Martinez-Cantin, R., Garcia-Aznar, J.M.: Integration of in vitro and in silico models using bayesian optimization with an application to stochastic modeling of mesenchymal 3D cell migration. Front. Phys. 9, 1246 (2018). ISSN 1664-042X. https://doi.org/10.3389/fphys.2018.01246, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6142046/
Niedz, R.P., Evens, T.J.: Design of Experiments (DOE)—history, concepts, and relevance to in vitro culture. Vitro Cell. Dev. Biol. Plant 52(6), 547–562 (2016). https://doi.org/10.1007/s11627-016-9786-1
Pauwels, E., Lajaunie, C., Vert, J.P.: A Bayesian active learning strategy for sequential experimental design in systems biology. BMC Syst. Biol. 8(1), 102 (2014). ISSN 1752-0509. https://doi.org/10.1186/s12918-014-0102-6, https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-014-0102-6
Raies, A.B., Bajic, V.B.: In silico toxicology: computational methods for the prediction of chemical toxicity: computational methods for the prediction of chemical toxicity. Wiley Interdiscip. Rev. Comput. Mol. Sci. 6(2), 147–172 (2016). ISSN 17590876. https://doi.org/10.1002/wcms.1240, https://onlinelibrary.wiley.com/doi/10.1002/wcms.1240
Rainforth, T., Foster, A., Ivanova, D.R., Bickford Smith, F.: Modern Bayesian experimental design. Stat. Sci. 39(1), 100–114 (2024)
Rakesh, V., Jain, S.: Efficacy of Bayesian neural networks in active learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2601–2609 (2021)
Richard, A.M., et al.: The Tox21 10K compound library: collaborative chemistry advancing toxicology. Chem. Res. Toxicol. 34(2), 189–216 (2021). ISSN 0893-228X, 1520-5010. https://doi.org/10.1021/acs.chemrestox.0c00264, https://pubs.acs.org/doi/10.1021/acs.chemrestox.0c00264
Rønneberg, L., Cremaschi, A., Hanes, R., Enserink, J.M., Zucknick, M.: Bayesynergy: flexible Bayesian modelling of synergistic interaction effects in in vitro drug combination experiments. Briefings Bioinform. 22(6), bbab251 (2021). ISSN 1477-4054. https://doi.org/10.1093/bib/bbab251
Schlander, M., Hernandez-Villafuerte, K., Cheng, C.-Y., Mestre-Ferrandiz, J., Baumann, M.: How much does it cost to research and develop a new drug? a systematic review and assessment. Pharmacoeconomics 39(11), 1243–1269 (2021). https://doi.org/10.1007/s40273-021-01065-y
Smith, F.B., Kirsch, A., Farquhar, S., Gal, Y., Foster, A., Rainforth, T.: Prediction-oriented Bayesian active learning. In: International Conference on Artificial Intelligence and Statistics, pp. 7331–7348. PMLR (2023)
Smith, F.B., Kirsch, A., Farquhar, S., Gal, Y., Foster, A., Rainforth, T.: Prediction-oriented Bayesian active learning. arXiv:2304.08151v1 (2023)
Smith, F.B., Foster, A., Rainforth, T.: Making better use of unlabelled data in Bayesian active learning. In: International Conference on Artificial Intelligence and Statistics, pp. 847–855. PMLR (2024)
Sun, D., Gao, W., Hu, H. and Zhou, S.: Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica. B 12(7), 3049–3062 (2022). ISSN 2211-3835. https://doi.org/10.1016/j.apsb.2022.02.002, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9293739/
Tabernilla, A., et al.: In vitro liver toxicity testing of chemicals: a pragmatic approach. Int. J. Mol. Sci. 22(9), 5038 (2021). ISSN 1422-0067. https://doi.org/10.3390/ijms22095038, https://www.mdpi.com/1422-0067/22/9/5038
Van Norman, G.A.: Phase II trials in drug development and adaptive trial design. JACC: Basic Transl. Sci. 4(3), 428–437 (2019). ISSN 2452302X. https://doi.org/10.1016/j.jacbts.2019.02.005, https://linkinghub.elsevier.com/retrieve/pii/S2452302X19300658
Zhang, Y., et al.: Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning. Chem. Sci. 10(35), 8154–8163 (2019)
Zhou, Z., Kearnes, S., Li, L., Zare, R.N., Riley, P.: Optimization of molecules via deep reinforcement learning. Sci. Rep. 9(1), 10752 (2019)
Acknowledgments
The authors acknowledge financial support from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 956832, “Advanced Machine learning for Innovative Drug Discovery” (AIDD).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this paper
Cite this paper
Masood, M.A., Cui, T., Kaski, S. (2025). Deep Bayesian Experimental Design for Drug Discovery. In: Clevert, DA., Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) AI in Drug Discovery. AIDD 2024. Lecture Notes in Computer Science, vol 14894. Springer, Cham. https://doi.org/10.1007/978-3-031-72381-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-72381-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72380-3
Online ISBN: 978-3-031-72381-0
eBook Packages: Computer ScienceComputer Science (R0)