scGala: Graph Link Prediction Based Cell Alignment for Comprehensive Data Integration
For detailed instructions, comprehensive documentation, and helpful tutorials, please visit:
Step 1: Create a conda environment for scGALA
conda create -n scGALA python=3.10 -y
conda activate scGALA
Step 2: Install Pytorch as described in its official documentation. Choose the platform and accelerator (GPU/CPU) accordingly to avoid common dependency issues. Currently the DGL package requires Pytorch <= 2.4.0.
A note regarding DGL for required package PyGCL and PyG
Currently the DGL team maintains two versions,
dgl
for CPU support anddgl-cu***
for CUDA support. Sincepip
treats them as different packages, it is hard for PyGCL to check for the version requirement ofdgl
. They have removed such dependency checks fordgl
in their setup configuration and require the users to install a proper version by themselves. It is the same with required Additional Libraries in PyG, please install the optional additional dependencies accordingly after install scGALA.
# Pytorch example, choose the cuda version accordingly
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# Install scGALA
pip install scGALA
# Example for DGL and PyG additional dependencies. Please read the note and install them based on your actual hardware.
# DGL
pip install dgl -f https://data.dgl.ai/wheels/torch-2.4/cu121/repo.html
# PyG
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
For the core function, which is the cell alignment in scGALA, simple run:
from scGALA import get_alignments
# You can get the edge probability matrix for one line
alignment_matrix = get_alignments(adata1=adata1,adata2=adata2)
# To get the anchor index for two datasets
anchor_index1, anchor_index2 = alignments_matrix.nonzero()
# The anchor cells are easy to obtain by
anchor_cell1 = adata1[anchor_index1]
anchor_cell2 = adata2[anchor_index1]
We also provide convenient APIs for enhancing Seurat-based anchors, imputing spatial transcriptomics, generating cross-modality data, and other useful features. Please refer to Tutorials and APIs for detailed walkthroughs.
All example data used in the Tutorials can be found in Figshare. The data used in batch correciton tutorial can be found in Figshare.
scGALA is designed to easily integrate into existing methods that employ cell alignments. The integration can be done in two modes: Module Replacement or External Reference, depending on the working strategy of the target method.
For methods under develop or run in an end-to-end way, then Module Replacement is the strategy to choose. Identify the Cell Alignment module (key words to look for: MNN, Alignment, CCA, Anchor, Correspondence) and replace it with scGALA as in the Usage.
Tutorial with INSCT as example: Module Replacement Tutorial Based On INSCT (Batch Correction). We presented the comparison experiment between scGALA-enhanced INSCT Supervised and original INSCT Supervised. To facilitate the evaluation, we use scIB to compute core metrics of batch correction.
For methods with clear procedure steps, we recommend the External Reference strategy, as this needs least efforts. In this mode, we don't replace the alignment module, instead, we enhance the intermediate cell alignment results given by the original method.
Tutorial with Seurat as example: External Reference Tutorial Based On Seurat (Label Transfer). We presented the comparison experiment between scGALA-enhanced Seurat and original Seurat. We proved APIs to efficiently enhance seurat-based anchors and compute anchor scores required by Seurat.
More tutorials are provided to demonstrate Multiomcs Integration based on scGALA-enhanced Seurat and Spatial Alignment based on scGALA-enhanced STAligner.
scGALA introduces a multiplet-omics integration strategy that bridges disjoint doublet datasets, such as RNA-ATAC and RNA-ADT, to computationally construct a triplet-omics dataset (RNA-ATAC-ADT), thus bypassing the need for specialized triple-modal sequencing protocols while maintaining coherence across modalities.
Tutorial: Multiplet-omics Integration with scGALA. We demonstrated how scGALA can be used for multiplet-omics integration, specifically integrating RNA+ATAC and RNA+ADT datasets through their shared RNA modality.
scGALA enables cross-modality data generation through a specialized Graph Attention Network framework. This allows for predicting RNA expression profiles from chromatin accessibility (ATAC-seq) data, effectively creating multimodal profiles from single-modality measurements.
Tutorial: Cross-modality Imputation with scGALA. We demonstrate how to use scGALA to generate gene expression (RNA) profiles from ATAC-seq data using cell-cell alignments as guiding information.
scGALA offers functionality to impute spatial transcriptomics data with the help of a reference scRNA dataset. This addresses a major limitation of spatial technologies, which typically measure only a few hundred genes compared to thousands in scRNA-seq.
Tutorial: Spatial Transcriptomics Imputation with scGALA. We show how to enhance spatially resolved transcriptomics by imputing unmeasured genes using a reference scRNA-seq dataset while preserving spatial context.
scGALA offers a comprehensive set of functions for various single-cell data integration tasks. Below are the key APIs organized by their purpose.
The main function for cell alignment between two datasets.
Key Parameters:
adata1
,adata2
: AnnData objects containing the datasets to alignout_dim
: Dimension of latent features (default: 32)k
: Number of neighbors for initial MNN search (default: 20)min_value
: Minimum alignment score threshold (default: 0.9)lamb
: Hyperparameter for score-based alignment (default: 0.3)spatial
: Whether to use spatial information in alignment (default: False)
Returns: Matrix of alignment probabilities between cells in the two datasets
Enhanced mutual nearest neighbors finding with graph learning.
Key Parameters:
data1
,data2
: Input datasetsk1
,k2
: Number of neighbors to consider in each dataset
Returns: Lists of mutual indices between datasets
Enhance Seurat anchors using scGALA's graph-based alignment.
Key Parameters:
anchors_ori
: Path to CSV file with original anchorsadata1
,adata2
: Paths to or AnnData objects for the datasetsmin_value
: Minimum alignment score threshold (default: 0.8)lamb
: Hyperparameter for anchor refinement (default: 0.3)
Returns: Enhanced alignment matrix
Calculate anchor scores for pairs of cells, useful for downstream integration tasks.
Key Parameters:
adata1
,adata2
: AnnData objects for the datasetsmnn1
,mnn2
: Lists of indices representing aligned cell pairs
Returns: Array of anchor scores
Replace TNN (INSCT) alignment with scGALA-enhanced alignment.
Key Parameters:
ds1
,ds2
: Input datasetsnames1
,names2
: Cell names for each datasetknn
: Number of neighbors for alignment (default: 20)min_value
: Minimum alignment score threshold (default: 0.8)
Returns: Aligned indices between datasets
Enhanced alignment for Scanorama integration method.
Key Parameters:
data1
,data2
: Input datasetsmatches
: Optional pre-computed matches
Returns: Aligned indices between datasets
Enhanced alignment for scDML batch correction method.
Key Parameters:
ds1
,ds2
: Input datasetsnames1
,names2
: Cell names for each datasetknn
: Number of neighbors (default: 20)
Returns: Aligned indices between datasets
Spatial-aware version of mnn_tnn that incorporates spatial coordinates.
Key Parameters:
ds1
,ds2
: Input expression datasetsspatial1
,spatial2
: Spatial coordinate informationnames1
,names2
: Cell names for each datasetmin_value
: Minimum alignment score threshold (default: 0.9)
Returns: Aligned indices between spatial datasets
Neural network model for imputing gene expression in spatial data.
Key Parameters in constructor:
num_features
: Number of input featuresn_matching_genes
: Number of genes shared between datasetshidden_channels
: Size of hidden layersnum_layers
: Number of GNN layers (default: 3)layer_type
: Type of GNN layer to use (default: 'GAT')
Usage: Used through MyDataModule_OneStage
in the spatial imputation workflow
Split datasets with controlled imbalance for robust testing.
Key Parameters:
adata
: Input AnnData objecttrain_ratio
: Overall ratio for training set (default: 0.7)group_key
: Column in adata.obs for splitting (default: 'cell_type')
Returns: Two AnnData objects (train and test)
Add synthetic batch effects for benchmarking integration methods.
Key Parameters:
data
: Input data matrixbatch_effect_strength
: Strength of batch effect (default: 0.3)noise_strength
: Strength of random noise (default: 0.3)
Returns: Data with simulated batch effects