GitHub - gcyuan/rescue: Bootstrap imputation for scRNAseq data

README

This package provides a bootstrap imputation method for dropout events in scRNAseq data published here.

News

Jul. 12, 2020

Version 1.0.3.
Include k neighbors as a parameter for NN network.
Bug fixes.
Independent of scRNAseq pipeline.
Informative genes may be determined using the computeHVG function (also the default) or any other package (e.g. Seurat, scran) and indicated with select_genes.

Requirements

R (>= 3.4)
Python (>= 3.0)

The SNN clustering step still uses the Louvain Algorithm but now borrows from the implementation in the Giotto pipeline.

Installation

Install the rescue package using devtools.

install.packages("devtools", repos="http://cran.rstudio.com/")
library(devtools)
devtools::install_github("seasamgo/rescue")
library(rescue)

Required python modules

pandas
networkx
community (from python-louvain)

Automatic installation

The python modules will be installed automatically in a miniconda environment when installing rescue. However, it will ask you whether you want to install them and you can opt out and go for a manual installation if that is preferred.

Manual installation

Install with pip

pip install pandas
pip install networkx
pip install python-louvain

If you chose the manual installation and have multiple python versions installed, you may preemptively force the reticulate package to use the desired version by specifying the path to the python version you want to use. This can be done using the python_path parameter within the bootstrapImputation function or directly set at the beginning of your script. For example:

reticulate::use_python('~/anaconda2/bin/python', required = T)

Method

bootstrapImputation takes a log-normalized expression matrix and returns a list containing the imputed and original matrices.

bootstrapImputation(
  expression_matrix,                  # expression matrix
  select_cells = NULL,                # subset cells
  select_genes = NULL,                # informative genes
  proportion_genes = 0.6,             # proportion of genes to sample
  log_transformed = TRUE,             # whether expression matrix is log-transformed
  log_base = exp(1),                  # log base of log-transformation
  bootstrap_samples = 100,            # number of samples
  number_pcs = 8,                     # number of PC's to consider
  k_neighbors = 30,                   # number of neighbors for NN network
  snn_resolution = 0.9,               # clustering resolution
  impute_index = NULL,                # specify counts to impute, defaults to zero values
  use_mclapply = FALSE,               # run in parallel
  cores = 2,                          # number of parallel cores
  return_individual_results = FALSE,  # return sample means
  python_path = NULL,                 # path to the python version to use, defaults to default path
  verbose = FALSE                     # print progress to console
  )

Similar cells are determined with shared nearest neighbors clustering upon the principal components of informative gene expression (e.g. highly variable or differentially expressed genes). The names of these informative genes may be indicated with select_genes, which defaults to the most highly variable. For more, please view the help files.

Example

To illustrate, we’ll need the Splatter package to simulate some scRNAseq data.

install.packages("BiocManager", repos="http://cran.rstudio.com/")
BiocManager::install("splatter")
library(splatter)

We’ll consider a hypothetical example of 500 cells and 10,000 genes containing five distinct cell types of near equal size, then introduce some dropout events.

params <- splatter::newSplatParams(
  nGenes = 1e4,
  batchCells = 500,
  group.prob = rep(.2, 5),
  de.prob = .05,
  dropout.mid = rep(0, 5),
  dropout.shape = rep(-.5, 5),
  dropout.type = 'group',
  seed = 940
  )
splat <- splatter::splatSimulate(params = params, method = 'groups')
cell_types <- SummarizedExperiment::colData(splat)$Group
cell_types <- as.factor(gsub('Group', 'Cell type ', cell_types))

For visualization purposes we’ll demonstrate using the Seurat pipeline, as with our published work but now v3. However, any pipeline which fits the needs of your downstream analysis will do.

install.packages("Seurat", repos="http://cran.rstudio.com/")
library(Seurat)

First, we should remove genes that lost all counts to dropout.

counts_true <- SummarizedExperiment::assays(splat)$TrueCounts
counts_dropout <- SummarizedExperiment::assays(splat)$counts
comparable_genes <- rowSums(counts_dropout) != 0

Next, we normalize and scale the data.

expression_true <- Seurat::CreateSeuratObject(counts = counts_true)
expression_true <- Seurat::NormalizeData(expression_true)
expression_true <- Seurat::ScaleData(expression_true, features = rownames(expression_true))
expression_dropout <- Seurat::CreateSeuratObject(counts = counts_dropout[comparable_genes, ])
expression_dropout <- Seurat::NormalizeData(expression_dropout)
expression_dropout <- Seurat::ScaleData(expression_dropout, features = rownames(expression_dropout))

The last step is dimension reduction with PCA and then visualization with t-SNE.

expression_true <- Seurat::RunPCA(expression_true, features = rownames(expression_true), verbose = FALSE)
expression_true <- Seurat::SetIdent(expression_true, value = cell_types)
expression_true <- Seurat::RunTSNE(expression_true)
Seurat::DimPlot(expression_true, reduction = "tsne")

expression_dropout <- Seurat::RunPCA(expression_dropout, features = rownames(expression_dropout), verbose = FALSE)
expression_dropout <- Seurat::SetIdent(expression_dropout, value = cell_types)
expression_dropout <- Seurat::RunTSNE(expression_dropout)
Seurat::DimPlot(ex
6080
pression_dropout, reduction = "tsne")

It’s clear that dropout has distorted our evaluation of the data by cell type as compared to what we should see with the full set of counts. Now let’s impute zero counts to recover missing expression values and reevaluate.

impute <- rescue::bootstrapImputation(expression_matrix = expression_dropout@assays$RNA@data) # python_path can be set here
expression_imputed <- Seurat::CreateSeuratObject(counts = impute$final_imputation)
expression_imputed <- Seurat::ScaleData(expression_imputed)
expression_imputed <- Seurat::RunPCA(expression_imputed, features = rownames(expression_imputed), verbose = FALSE)
expression_imputed <- Seurat::SetIdent(expression_imputed, value = cell_types)
expression_imputed <- Seurat::RunTSNE(expression_imputed)
Seurat::DimPlot(expression_imputed, reduction = "tsne")

The recovery of missing expression values due to dropout events allows us to more accurately distinguish cell types with basic data visualization techniques in this simulated example.

References

Tracy, S., Dries, R., Yuan, G. C. (2019). RESCUE: imputing dropout events in single-cell RNA-sequencing data. BMC Bioinformatics, 20(388). doi: https://doi.org/10.1186/s12859-019-2977-0.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
R		R
inst		inst
man		man
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
rescue.Rproj		rescue.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

News

Requirements

Installation

Required python modules

Automatic installation

Manual installation

Method

Example

References

About

Uh oh!

Releases

Packages

Languages

gcyuan/rescue

Folders and files

Latest commit

History

Repository files navigation

README

News

Requirements

Installation

Required python modules

Automatic installation

Manual installation

Method

Example

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages