Abstract
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
£14.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
£139.00 per year
only £11.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All datasets used in the study are publicly available. Supplementary Table 2 lists the datasets from 17 published studies (sources included). The preprocessed datasets are available in the Zenodo repository at https://doi.org/10.5281/zenodo.711076152.
Code availability
The scDesign3 package is available at https://github.com/SONGDONGYUAN1994/scDesign3. The comprehensive tutorials are available at https://songdongyuan1994.github.io/scDesign3/docs/index.html. In the tutorials, we described the input and output formats, model parameters and exemplary datasets for each functionality of scDesign3. The source code for reproducing the results is available in the Zenodo repository at https://doi.org/10.5281/zenodo.711076152.
References
Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Cusanovich, D. A. et al. Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Karemaker, I. D. & Vermeulen, M. Single-cell DNA methylation profiling: technologies and biological applications. Trends Biotechnol. 36, 952–965 (2018).
Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Rao, N., Clark, S. & Habern, O. Bridging genomics and tissue pathology: 10x genomics explores new frontiers with the visium spatial gene expression solution. Genet. Eng. Biotechnol. News 40, 50–51 (2020).
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39, 313–319 (2021).
Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).
Efremova, M. & Teichmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).
Cao, Y., Yang, P. & Yang, J. Y. H. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat. Commun. 12, 6911 (2021).
Crowell, H. L., Morillo Leonardo, S. X., Soneson, C. & Robinson, M. D. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol. 24, 62 (2023).
Sun, T., Song, D., Li, W. V. & Li, J. J. scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol. 22, 163 (2021).
Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).
Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Cannoodt, R., Saelens, W., Deconinck, L. & Saeys, Y. Spearheading future omics analyses using dyngen, a multi-modal simulator of single cells. Nat. Commun. 12, 3942 (2021).
Dibaeinia, P. & Sinha, S. Sergio: a single-cell expression simulator guided by gene regulatory networks. Cell Syst. 11, 252–271 (2020).
Papadopoulos, N., Gonzalo, P. R. & Söding, J. Prosstt: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics 35, 3517–3519 (2019).
Tian, J., Wang, J. & Roeder, K. Esco: single cell expression simulation incorporating gene co-expression. Bioinformatics 37, 2374–2381 (2021).
Navidi, Z., Zhang, L. & Wang, B. simATAC: a single-cell ATAC-seq simulation framework. Genome Biol. 22, 74 (2021).
Li, W. V. & Li, J. J. A statistical simulator scDesign for rational scRNA-seq experimental design. Bioinformatics 35, i41–i50 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166 (2020).
Ma, Y. & Zhou, X. Spatially informed cell-type deconvolution for spatial transcriptomics. Nat. Biotechnol. 40, 1349–1359 (2022).
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022).
Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I. & Heyn, H. SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Res. 49, e50 (2021).
Yan, G. & Li, J. J. scReadSim: a single-cell multi-omics read simulator. Preprint at bioRxiv https://doi.org/10.1101/2022.05.29.493924 (2022).
Cao, K., Hong, Y. & Wan, L. Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona. Bioinformatics 38, 211–219 (2022).
Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat. Biotechnol. 39, 1202–1215 (2021).
Fang, J. et al. Clustering deviation index (CDI): a robust and accurate internal measure for evaluating scRNA-seq data clustering. Genome Biol. 23, 269 (2022).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1441 (2018).
Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
Ji, Z. & Ji, H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 44, e117 (2016).
Stasinopoulos, D. M. & Rigby, R. A. Generalized additive models for location scale and shape (GAMLSS) in R. J. Stat. Softw. 23, 1–46 (2008).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2, lqaa078 (2020).
Wood, S. N. Generalized Additive Models: An Introduction with R (Chapman and Hall/CRC, 2006).
Kammann, E. E. & Wand, M. P. Geoadditive models. J. R. Stat. Soc. C 52, 1–18 (2003).
Czado, C. Analyzing Dependent Data with Vine Copulas (Springer, 2019).
Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 5, 2122 (2016).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Zhu, J., Sun, S. & Zhou, X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol. 22, 184 (2021).
Li, B. et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nat. Methods 19, 662–670 (2022).
Lütge, A. et al. CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data. Life Sci. Alliance 4, e202001004 (2021).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Zeng, D. et al. IOBR: multi-omics immuno-oncology biological research to decode tumor microenvironment and signatures. Front. Immunol. 12, 687975 (2021).
Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat. Methods 18, 1352–1362 (2021).
Moriel, N. et al. Novosparc: flexible spatial reconstruction of single-cell gene expression with optimal transport. Nat. Protoc. 16, 4177–4200 (2021).
Song, D., Wang, Q. & Li, J. J. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Zenodo https://doi.org/10.5281/zenodo.7110761 (2022).
Acknowledgements
We appreciate the comments and feedback from the members of the Junction of Statistics and Biology at UCLA (http://jsb.ucla.edu). This work was supported by the following grants: National Science Foundation grants no. DBI-1846216 and no. DMS-2113754, NIH/NIGMS grants no. R01GM120507 and no. R35GM140888, Johnson & Johnson WiSTEM2D Award, the Sloan Research Fellowship, the UCLA David Geffen School of Medicine W. M. Keck Foundation Junior Faculty Award and the Chan-Zuckerberg Initiative Single-Cell Biology Data Insights Grant (to J.J.L.). J.J.L. was a fellow at the Radcliffe Institute for Advanced Study at Harvard University in 2022–2023 while she was writing this paper.
Author information
Authors and Affiliations
Contributions
D.S. and J.J.L. conceived of the study. D.S., Q.W. and J.J.L. wrote the paper. D.S. and Q.W. developed the scDesign3 R package. D.S. and Q.W. performed data analysis with assistance from G.Y. and T.L. D.S. and T.S. discussed the scDesign3 method design at the beginning of the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Kin Fai Au and Jean Yang for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Benchmarking scDesign3 against four existing scRNA-seq simulators (scGAN, muscat, SPARSim, and ZINB-WaVE) for generating scRNA-seq data from a single trajectory (mouse pancreatic endocrinogenesis; dataset PANCREAS in Supplementary Table 2).
a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 and the four simulators. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3 and the four simulators. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3 and the four simulators. Colors label cells’ pseudotime values; note that only the synthetic data generated by scDesign3 contain the pseudotime truths. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3 and the four simulators.
Extended Data Fig. 2 Benchmarking scDesign3 against four existing scRNA-seq simulators (scGAN, muscat, SPARSim, and ZINB-WaVE) for generating scRNA-seq data from bifurcating trajectories (myeloid progenitors in mouse bone marrow; dataset MARROW in Supplementary Table 2).
a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 and the four simulators. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3 and the four simulators. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3 and the four simulators. Colors label cells’ pseudotime values in two trajectories; note that only the synthetic data generated by scDesign3 contain the pseudotime truths. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3 and the four simulators.
Extended Data Fig. 3 scDesign3 simulated realistic gene expression patterns in cancer spatial transcriptomics data (datasets OVARIAN and ACINAR in Supplementary Table 2.
Human ovarian cancer (a) and human prostate cancer, acinar cell carcinoma (b). The tissue samples were measured with both H&E (hematoxylin and eosin stain, left) and spatial transcriptomics (right, three cancer-related genes). Large Pearson correlation coefficients (r) represent similar spatial patterns in synthetic data and real (test) data.
Extended Data Fig. 4 scDesign3 simulated 10x Visium spatial transcriptomics data (sagital mouse brain slices; dataset VISIUM in Supplementary Table 2).
a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 using cell type labels (scDesign3-ideal) and spatial locations (scDesign3-spatial), respectively. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene-gene correlation matrices (showing top 100 highly expressed genes) in the test data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the real data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. Cell types (clusters) are labeled by colors. Since the scDesgin3-spatial dataset was based on spatial locations only, it did not contain cell types. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the real data and the synthetic data generated by scDesign3-ideal and scDesign3-spatial. In summary, scDesign3 realistically simulated 10x Visium data based on spatial locations without needing cell type annotations.
Extended Data Fig. 5 scDesign3 mimicked spatial transcriptomics data so that prediction algorithms had similar prediction performance when trained on real data or scDesign3 synthetic data.
In detail, we first split each of four spatial transcriptomics datasets (VISIUM, SLIDE, OVARIAN, and ACINAR in Supplementary Table 2) into two datasets (training and testing) by randomly splitting the spatial locations into two halves. Second, we used each of the four training datasets to fit scDesign3 and generate the corresponding synthetic dataset. Third, on each pair of training dataset and synthetic dataset (among a total of four pairs), we trained each of three prediction algorithms (gbm: gradient boosting machine; randomForest: random forest; svmRadial: support vector machine with the radial kernel) to predict each gene’s expression at a spatial location (input: spatial location; output: the gene’s log(count+1) expression level at the location), obtaining a pair of prediction models for each gene. Fourth, we applied each pair of prediction models to the corresponding testing dataset and calculated each model’s root-mean-squared error (RMSE) for predicting the corresponding gene, obtaining a pair of RMSEs. As a result, in each panel, we plotted the RMSEs for each prediction algorithm (row) and dataset (column), with each dot in the panel representing a gene. We found all genes’ RMSEs highly similar, indicating that scDesign3’s synthetic data well mimicked real data.
Extended Data Fig. 6 The effect of K on scDesign3’s simulation of spatial transcriptomics data (dataset ACINAR in Supplementary Table 2).
The rows represent three cancer-related genes; column 1 represents real test data; columns 2–8 represent scDesign3’s synthetic data generated using varying K, the input basis number. A large Pearson correlation coefficient (r) represents similar spatial patterns in synthetic and test data. The effective degrees of freedom (edf) represents the wiggliness of the fitted surface. With a larger K, scDesign3 can fit more complex patterns. The overfitting issue is accounted for by the automatic smoothness estimation39: when K is sufficiently large, edf (model complexity) and r (model goodness-of-fit) both become stable.
Extended Data Fig. 7 scDesign3 simulated spot-resolution spatial transcriptomics data for benchmarking cell-type deconvolution algorithms (datasets MOB-SP and MOB-SC in Supplementary Table 2).
a, scDesign3’s synthetic spot-resolution data well mimicked real data (top row), showing similar expression patterns for four cell-type marker genes (columns). scDesign3 used three steps to generate the spot-resolution data. Step 1: every gene’s estimated mean expression level at each spot (as a smooth function of spot location) by scDesign3. Step 2: every gene’s predicted expression level at each spot from CIBERSORT’s estimated cell-type proportions at the spot (considered as the ‘true proportions’) and the gene’s cell-type-specific expression levels (from the reference scRNA-seq data). Step 3: every gene’s simulated expression level at each spot by scDesign3 (from the true cell-type proportions at the spot and scDesign3’s synthetic scRNA-seq data). b, Using scDesign3 synthetic data, we benchmarked three spatial cell-type deconvolution algorithms (CARD6, RCTD7, and SPOTlight8). For each of the four cell types (columns), we used two metrics-Pearson correlation (r) and root-mean-square error (RMSE)-to compare the proportions estimated by each deconvolution algorithm (rows 2-4) to the true proportions (top row). Large r values represent similar spatial patterns of proportions, while small RMSE values represent similar values of proportions. Although all three algorithms well captured the spatial patterns of each cell type’s proportions (evidenced by large r values), CARD and RCTD outperformed SPOTlight by estimating cell-type proportions more accurately (evidenced by smaller RMSE values).
Extended Data Fig. 8 scDesign3 simulated scATAC-seq data (human PBMCs; dataset ATAC in Supplementary Table 2).
a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3 using cell type labels. Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the peak-peak correlation matrices in the test data and the synthetic data generated by scDesign3. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3. Cell types are labeled by colors. An mLISI value close to 2 means that the synthetic data resemble the test data well in the low-dimensional space. d, UMAP visualization of the test data and the synthetic data generated by scDesign3.
Extended Data Fig. 9 scDesign3 simulated CITE-seq data (human PBMCs; dataset CITE in Supplementary Table 2).
a, Distributions of eight summary statistics in the test data and the synthetic data generated by scDesign3. The CITE-seq dataset contains simultaneous measurements of each cell’s gene expression and surface protein abundance captured by Antibody-Derived Tags (ADTs). Each number on top of a violin plot (the distribution of a summary statistic in a synthetic dataset) is the Kolmogorov-Smirnov (KS) distance between the synthetic data distribution (indicated by that violin plot) and the test data distribution. A smaller number indicates better agreement between the synthetic data and the test data in terms of that summary statistic’s distribution. b, Heatmaps of the gene and protein correlation matrices (10 proteins with names starting with ‘ADT’ and their corresponding genes) in the test data and the synthetic data generated by scDesign3. Pearson’s correlation coefficient r measures the similarity between two correlation matrices, one from the test data and the other from the synthetic data. scDesign3 preserved the correlations between the RNA and protein expression levels of the 10 surface proteins. c, PCA visualization (top two PCs) of the test data and the synthetic data generated by scDesign3. Cell types are labeled by colors. An mLISI value close to 2 means that the synthetic data resemble the real data well in the low-dimensional space. d, UMAP visualization of the test data and the synthetic data generated by scDesign3.
Extended Data Fig. 10 scDesign3 provides unsupervised measures of the goodness-of-fit of pseudotime, clusters, and inferred spatial locations.
For visual clarity, we plot the relative BIC or AIC (rBIC or rAIC) by re-scaling scDesign3’s marginal BIC or AIC to [0, 1]. a, The scDesign3 rBIC (unsupervised) is negatively correlated with the R2 (supervised). Each R2 was calculated between the set of perturbed or inferred pseudotimes and the set of true pseudotimes in each of the eight datasets (the column names). The P value is from the one-sided test of Spearman’s rank correlation ρ. The true pseudotime is the ground truth used for generating the synthetic data. b, Comparison of the scDesign3 rBIC and the Clustering Deviation Index (CDI) rBIC (rescaled to [0, 1])33. The color scale shows the number of clusters, and the shapes represent clustering algorithms. We found the scDesign3 rBIC (unsupervised) negatively correlated with the ARI (supervised). The P value is from the one-sided test of Spearman’s rank correlation ρ. We also found the scDesign3 rBIC to perform better or similarly to the CDI on six out of the eight datasets (the column names). c, The scDesign3 rAIC (unsupervised) is negatively correlated with the mean cosine similarity (supervised). The mean cosine similarity was calculated between the set of perturbed or inferred locations and the set of true locations in each of the two spatial datasets (the column names). The P value is from the one-sided test of Spearman’s rank correlation ρ. The true locations are the ground truth used for generating the semi-synthetic data. Due to the high complexity of spatial patterns, the scDesign3 rAIC (left) outperformed the scDesign3 rBIC (right) for penalizing the model complexity less.
Supplementary information
Supplementary Information
Supplementary Methods, Figs. 1–5 and Tables 1–5.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Song, D., Wang, Q., Yan, G. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 42, 247–252 (2024). https://doi.org/10.1038/s41587-023-01772-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-023-01772-1
This article is cited by
-
Response to "Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives" and "Winsorization greatly reduces false positives by popular differential expression methods when analyzing human population samples"
Genome Biology (2024)
-
Evaluating spatially variable gene detection methods for spatial transcriptomics data
Genome Biology (2024)
-
scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data
Genome Biology (2024)
-
Pathogenomics for accurate diagnosis, treatment, prognosis of oncology: a cutting edge overview
Journal of Translational Medicine (2024)
-
GraphPCA: a fast and interpretable dimension reduction algorithm for spatial transcriptomics data
Genome Biology (2024)