scE2G

This project is licensed under the terms of the MIT license.

The preprint describing scE2G is available here.

Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (mulitome) data per cell cluster

Output: Genomewide enhancer-gene regulatory link predictions per cell cluster

The pipeline consists of the following components:

Compute ABC model predictions for each cell cluster
Generate E2G features from ABC predictions
If running scE2G (Multiome): compute Kendall correlation and/or ARC-E2G score for each cell cluster
Combine 2 & 3 to construct a feature file to be used as input to predictive model
(optional) Train predictive model using CRISPR-validated E-G pairs from K562 dataset
Apply trained model to make predictions by assigning a score to each E-G pair

Set up

Clone the repo and set it up for submodule usage. Note that you will need to use the --recurse-submodules flag when cloning or updating the repo to ensure the submodules are up-to-date.

# clone the top-level repo and initialize submodules recursively
git clone --recurse-submodules https://github.com/EngreitzLab/scE2G.git

# initialize and update nested submodules
cd scE2G
git submodule update --init --recursive

When running for the first time, the conda environments have to be setup. We highly recommend running scE2G using the environment specified in workflow/envs/run_snakemake.yml, which specifies the exact package versions compatible with the pipeline.

conda config --set channel_priority flexible  # Make sure your conda config uses flexible channel packaging to prevent unsatisfiable errors
conda env create -f workflow/envs/run_snakemake.yml
conda activate run_snakemake

Apply model

Before running this workflow, users should perform clustering to define cell clusters through regular single-cell analysis (such as Seurat & Signac). Required input data includes (refer to the example data in the resources/example_chr22_multiome_cluster folder):

Pseudobulk fragment files and their corresponding *.tbi index files in the same directory for each cell cluster
- Must be sorted by coordinates and gzipped. If it is not sorted, you can use the sortBed tool from bedtools: sortBed atac_fragments.unsorted.tsv > atac_fragments.tsv.
- Must have 5 columns (no header) corresponding to chr, start, end, cell_name, read_count (usually just 1)
- The cell_name column should correspond to the cell names in the RNA count matrix. All cells in the RNA matrix must be represented in the fragment file and vice versa.
- To create a .tbi index, use bgzip from HTSlib instead of gzip to compress the fragment file: bgzip atac_fragments.tsv, then generate the corresponding .tbi index file using tabix -p bed atac_fragments.tsv.gz.
For scE2G (Multiome): RNA count matrix (gene x cell) for each cell cluster
- Use unnormalized (raw) counts
- Ensure there are not duplicated gene names
- Accepted formats for RNA matrix:
  1. .csv.gz format
  2. .h5ad or .h5 format (you may need to modify the default conda environment with the anndata version used to create the file)
  3. Directory with sparse matrix files from 10x (matrix.mtx.gz, genes.tsv.gz or features.tsv.gz, and barcodes.tsv.gz). Note that files are read with the argument gene.column = 1.
- By default, gene names in the RNA matrix are mapped to our gene reference file via Ensembl ID using gene annotations from the GENCODE v43 release (included in resources/gencode.v43.chr_patch_hapl_scaff.annotation.gtf.gz). You can modify the annotations to another GENCODE version in in the gene_annotation field of config/config.yaml. For example, CellRanger uses the GENCODE v32 gene annotations by default; this annotation file can be downloaded here.

To configure the pipeline:

Modify config/config.yaml to specify paths for results_dir.
Modify config/config_cell_clusters.tsv to specify the cluster name, ATAC fragment file path, and RNA matrix path. Leave the remaining columns empty.
If running scE2G (ATAC), also leave the RNA matrix path empty, but still include the column.
Specify model directory from those in models/ to be applied

Running the pipeline:

snakemake -j1 --use-conda

This command make take a while the first time you run it, as it needs to build the conda environments. But if it takes more than 1 hour, that's usually a bad sign, meaning that you're not using mamba and/or need more memory to build the environment.

Output will show up in the results/ directory by default, with the structure results/cell_cluster/model_name/encode_e2g_predictions.tsv.gz. The score column to use is E2G.Score.qnorm.

Train model

Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)

Modify config/config_training.yaml with your model and cell_cluster configs

model_config has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?) Note that trained models generated using polynomial features cannot directly be used in the Apply model workflow
cell_cluster_config has rows representing each "dataset" in model_config, where each "dataset" must correspond to a "cluster" in cell_cluster_config
- If an ABC_directory is not specified for a dataset, its entry in cell_cluster_config must also contain the required ABC biosample parameters
- TO DO: specify how to generate and formats for Kendall parameters
To apply a trained model, it must contain the following files: 1) model.pkl, 2) feature_table.tsv, 3) score_threshold_.XX , 5) tpm_threshold_YY (YY=0 if ATAC-only model), 4) qnorm_reference.tsv.gz (single column with header E2G.Score that contains raw scores for genomewide predictions)

Running the pipeline:

snakemake -s workflow/Snakefile_training -j1 --use-conda

Output will show up in the results_training/ directory by default

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
.circleci		.circleci
ENCODE_rE2G @ 64dcb0c		ENCODE_rE2G @ 64dcb0c
config		config
models		models
resources		resources
tests		tests
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scE2G

Set up

Apply model

Train model

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

EngreitzLab/scE2G

Folders and files

Latest commit

History

Repository files navigation

scE2G

Set up

Apply model

Train model

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages