[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2022 Mar 31;38(10):2946–2948. doi: 10.1093/bioinformatics/btac190

PyLiger: scalable single-cell multi-omic data integration in Python

Lu Lu 1, Joshua D Welch 2,3,
Editor: Anthony Mathelier
PMCID: PMC9306758  PMID: 35561174

Abstract

Motivation

LIGER (Linked Inference of Genomic Experimental Relationships) is a widely used R package for single-cell multi-omic data integration. However, many users prefer to analyze their single-cell datasets in Python, which offers an attractive syntax and highly optimized scientific computing libraries for increased efficiency.

Results

We developed PyLiger, a Python package for integrating single-cell multi-omic datasets. PyLiger offers faster performance than the previous R implementation (2–5× speedup), interoperability with AnnData format, flexible on-disk or in-memory analysis capability and new functionality for gene ontology enrichment analysis. The on-disk capability enables analysis of arbitrarily large single-cell datasets using fixed memory.

Availability and implementation

PyLiger is available on Github at https://github.com/welch-lab/pyliger and on the Python Package Index.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

High-throughput sequencing technologies now enable the measurement of gene expression, DNA methylation, histone modification and chromatin accessibility at the single-cell level. Integration of such single-cell multi-omic datasets is crucial for identifying cell types and cell states across a range of biological settings. Previously, we developed LIGER (Linked Inference of Genomic Experimental Relationships), an R package that employs integrative non-negative matrix factorization to identify shared and dataset-specific factors of cellular variation (Welch et al., 2019). These factors then provide a principled and quantitative definition of cellular identity and how it varies across biological settings.

Many users prefer to analyze their single-cell datasets in Python, which offers an attractive syntax and highly optimized scientific computing libraries for increased efficiency. However, there is a lack of single-cell multi-omic integration tools available in Python. The Seurat v3 (Stuart et al., 2019) anchors algorithm is implemented in R, as is Harmony (Korsunsky et al., 2019). Scanpy (Wolf et al., 2018) offers excellent libraries for single-cell RNA-seq analysis, including batch correction with the BBKNN algorithm, but this approach is not designed for multi-omic integration such as combining scRNA and snATAC from different cells. The scvi-tools (Gayoso et al., 2021) library similarly provides options for scRNA integration but is not designed for integrating different single-cell modalities from different individual cells.

To address these limitations, we developed PyLiger, a Python implementation of LIGER.

2 Results

2.1 Python implementation of LIGER

We translated the complete, established LIGER framework into Python. Key functions include integration of multiple single-cell datasets using integrative non-negative matrix factorization, joint clustering, visualization and differential expression testing (Fig. 1A). We carefully compared outputs to ensure that function outputs from the R and Python versions are identical to within the limits of numerical precision. The only exceptions are cases when external packages called by PyLiger, such as UMAP and Leiden, produce slightly different results between R and Python.

Fig. 1.

Fig. 1.

(A) PyLiger functions include preprocessing, joint matrix factorization, joint clustering, visualization, differential expression testing and GO enrichment analysis. (B) Diagram of how LIGER class member variables are represented in AnnData format

As an additional feature, we embedded new functionality for gene ontology (GO) enrichment analysis within PyLiger. This makes it much easier to formulate hypotheses about the functions of key genes that are differentially expressed across cell types or biological conditions. Specifically, PyLiger incorporates GOATOOLS (Klopfenstein et al., 2018) for GO enrichment testing and GO-Figure! (Reijnders and Waterhouse, 2021) for visualizing enriched GO terms. Given a list of genes of interest, users can easily run a PyLiger function to identify a list of significantly enriched GO terms. The genes of interest may be derived by e.g. finding genes differentially expressed between clusters; genes differentially expressed across datasets within a cluster or genes with high contribution to metagene factors. Users may then further visualize the GO terms by semantic similarity scatterplots (Supplementary Fig. 1D). The plotting functions allow full customization of colormap, labels and other plot elements.

2.2 PyLiger adapts AnnData format to interoperate with existing packages

We designed the structure of the PyLiger class to smoothly interface with the widely used AnnData format. The AnnData package was initially introduced along with Scanpy offering a convenient way to store data matrices and annotations together. We store cell factor loading matrices (Hi), shared metagenes (W) and dataset-specific metagenes (Vi) as annotations of the raw matrix (Fig. 1B). The use of AnnData format also facilitates interoperability with existing single-cell analysis tools such as Scanpy and scVelo (Bergen et al., 2020). We use the naming rules from Scanpy to name our annotations (UMAP coordinates, for instance) so that each individual AnnData object can be plugged into Scanpy easily.

2.3 Python implementation reduces runtimes

To demonstrate the performance of PyLiger, we tested our functions using a dataset of 100 000 cells sampled from the adult mouse cortex (Saunders et al., 2018). We confirmed that the results from PyLiger are identical (to within numerical precision) to those from the LIGER R package (External packages called by PyLiger, such as UMAP and Leiden, produce slightly different results between R and Python in some cases.) (Supplementary Fig. 1A and B). Moreover, PyLiger functions run 1.5–5 times faster than their R counterparts (Supplementary Fig. 2). In particular, the most time-consuming step—matrix factorization (using in-memory mode iNMF)—is ∼3 times faster in Python than our previous R implementation. This is particularly impressive because many of the R functions are implemented using Rcpp, whereas all of PyLiger are simply implemented in native Python.

2.4 PyLiger scales to arbitrarily large single-cell datasets using fixed memory

PyLiger uses the HDF5 file format for on-demand loading of datasets stored on disk. We found that in AnnData objects, only the raw matrix allows HDF5-based backing, but not other processed matrices stored as layers. Therefore, we store data matrices in a separate HDF5 file while matrix annotations are still stored in AnnData objects. We compared the on-disk mode to the in-memory mode using the same dataset of 100 000 cells. By sacrificing a little processing efficiency (about 2 s on a dataset of 100 000 cells), the on-disk mode functions can process arbitrarily large datasets using fixed memory. Note that the function create_liger in the on-disk mode of PyLiger is slightly slower than on-disk mode of LIGER due to new feature implementation.

Moreover, we implemented the online iNMF algorithm (Gao et al., 2021) in combination with HDF5 file format, providing scalable and efficient data integration as well as significant memory savings. The online iNMF algorithm scales to arbitrarily large numbers of cells but still uses fixed memory and can incorporate new data without recalculating from scratch. To benchmark the performance, we did a comparison of online iNMF between PyLiger and LIGER using five datasets of increasing sizes (ranging from 10 000 to 255 353 cells in total) sampled from the same adult mouse cortex dataset. The PyLiger implementation of online iNMF achieves a 2.3× speedup on average in comparison to its R counterpart (Supplementary Fig. 2B). To demonstrate that the online iNMF algorithm can perform data integration using fixed memory, we tested both the in-memory and the on-disk mode of online iNMF on the same five datasets. With the increasing of number of cells, the on-disk mode of online iNMF has constant peak memory consumption (Supplementary Fig. 2C).

3 Conclusion

PyLiger provides an effective way to integrate large-scale single-cell multi-omic datasets. Its Python implementation enables convenient interoperability with other single-cell analysis tools and advanced machine learning and deep learning approaches. Embedded GO enrichment analysis and visualization modules provide a convenient interface for downstream analysis. Furthermore, incorporating online iNMF and HDF5 file format allow PyLiger to process large numbers of cells quickly and with limited memory.

Funding

This work was supported by R01HG010883 to J.D.W.

Conflict of Interest: none declared.

Supplementary Material

btac190_Supplementary_Data

Contributor Information

Lu Lu, Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.

Joshua D Welch, Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA.

References

  1. Bergen V.  et al. (2020) Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol., 38, 1408–1414. [DOI] [PubMed] [Google Scholar]
  2. Gao C.  et al. (2021) Iterative single-cell multi-omic integration using online learning. Nat. Biotechnol., 39, 1000–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Gayoso A.  et al. (2021) A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol., 40, 163–166. https://www.nature.com/articles/s41587-021-01206-w. [DOI] [PubMed]
  4. Klopfenstein D.V.  et al. (2018) GOATOOLS: a Python library for gene ontology analyses. Sci. Rep., 8, 10872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Korsunsky I.  et al. (2019) Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods, 16, 1289–1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Reijnders M.J.M.F., Waterhouse R.M. (2021) Summary visualizations of gene ontology terms with GO-Figure!  Front. Bioinform., 1, 638255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Saunders A.  et al. (2018) Molecular diversity and specializations among the cells of the adult mouse brain. Cell, 174, 1015–1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Stuart T.  et al. (2019) Comprehensive integration of single-cell data. Cell, 177, 1888–1902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Welch J.D.  et al. (2019) Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell, 177, 1873–1887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Wolf F.  et al. (2018) SCANPY: large-scale single-cell gene expression data analysis. Genome Biol., 19, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btac190_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES