8000 GitHub - EngreitzLab/scE2G: Pipeline to run scE2G
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

EngreitzLab/scE2G

Repository files navigation

scE2G

This project is licensed under the terms of the MIT license.

The preprint describing scE2G is available here.

Input: Single-cell ATAC-seq or paired ATAC and RNA-seq (mulitome) data per cell cluster

Output: Genomewide enhancer-gene regulatory link predictions per cell cluster

The pipeline consists of the following components:

  1. Compute ABC model predictions for each cell cluster
  2. Generate E2G features from ABC predictions
  3. If running scE2G (Multiome): compute Kendall correlation and/or ARC-E2G score for each cell cluster
  4. Combine 2 & 3 to construct a feature file to be used as input to predictive model
  5. (optional) Train predictive model using CRISPR-validated E-G pairs from K562 dataset
  6. Apply trained model to make predictions by assigning a score to each E-G pair

Set up

Clone the repo and set it up for submodule usage. Note that you will need to use the --recurse-submodules flag when cloning or updating the repo to ensure the submodules are up-to-date.

# clone the top-level repo and initialize submodules recursively
git clone --recurse-submodules https://github.com/EngreitzLab/scE2G.git

# initialize and update nested submodules
cd scE2G
git submodule update --init --recursive

When running for the first time, the conda environments have to be setup. We highly recommend running scE2G using the environment specified in workflow/envs/run_snakemake.yml, which specifies the exact package versions compatible with the pipeline.

conda config --set channel_priority flexible  # Make sure your conda config uses flexible channel packaging to prevent unsatisfiable errors
conda env create -f workflow/envs/run_snakemake.yml
conda activate run_snakemake

Apply model

Before running this workflow, users should perform clustering to define cell clusters through regular single-cell analysis (such as Seurat & Signac). Required input data includes (refer to the example data in the resources/example_chr22_multiome_cluster folder):

  1. Pseudobulk fragment files and their corresponding *.tbi index files in the same directory for each cell cluster
    • Must be sorted by coordinates and gzipped. If it is not sorted, you can use the sortBed tool from bedtools: sortBed atac_fragments.unsorted.tsv > atac_fragments.tsv.
    • Must have 5 columns (no header) corresponding to chr, start, end, cell_name, read_count (usually just 1)
    • The cell_name column should correspond to the cell names in the RNA count matrix. All cells in the RNA matrix must be represented in the fragment file and vice versa.
    • To create a .tbi index, use bgzip from HTSlib instead of gzip to compress the fragment file: bgzip atac_fragments.tsv, then generate the corresponding .tbi index file using tabix -p bed atac_fragments.tsv.gz.
  2. For scE2G (Multiome): RNA count matrix (gene x cell) for each cell cluster
    • Use unnormalized (raw) counts
    • Ensure there are not duplicated gene names
    • Accepted formats for RNA matrix:
      1. .csv.gz format
      2. .h5ad or .h5 format (you may need to modify the default conda environment with the anndata version used to create the file)
      3. Directory with sparse matrix files from 10x (matrix.mtx.gz, genes.tsv.gz or features.tsv.gz, and barcodes.tsv.gz). Note that files are read with the argument gene.column = 1.
    • By default, gene names in the RNA matrix are mapped to our gene reference file via Ensembl ID using gene annotations from the GENCODE v43 release (included in resources/gencode.v43.chr_patch_hapl_scaff.annotation.gtf.gz). You can modify the annotations to another GENCODE version in in the gene_annotation field of config/config.yaml. For example, CellRanger uses the GENCODE v32 gene annotations by default; this annotation file can be downloaded here.

To configure the pipeline:

  • Modify config/config.yaml to specify paths for results_dir.
  • Modify config/config_cell_clusters.tsv to specify the cluster name, ATAC fragment file path, and RNA matrix path. Leave the remaining columns empty.
  • If running scE2G (ATAC), also leave the RNA matrix path empty, but still include the column.
  • Specify model directory from those in models/ to be applied

Running the pipeline:

snakemake -j1 --use-conda

This command make take a while the first time you run it, as it needs to build the conda environments. But if it takes more than 1 hour, that's usually a bad sign, meaning that you're not using mamba and/or need more memory to build the environment.

Output will show up in the results/ directory by default, with the structure results/cell_cluster/model_name/encode_e2g_predictions.tsv.gz. The score column to use is E2G.Score.qnorm.

Train model

Important: Only train models for biosamples matching the corresponding CRISPR data (in this case, K562)

Modify config/config_training.yaml with your model and cell_cluster configs

  • model_config has columns: model, dataset, ABC_directory, feature_table, polynomial (do you want to use polynomial features?) Note that trained models generated using polynomial features cannot directly be used in the Apply model workflow
  • cell_cluster_config has rows representing each "dataset" in model_config, where each "dataset" must correspond to a "cluster" in cell_cluster_config
    • If an ABC_directory is not specified for a dataset, its entry in cell_cluster_config must also contain the required ABC biosample parameters
    • TO DO: specify how to generate and formats for Kendall parameters
  • To apply a trained model, it must contain the following files: 1) model.pkl, 2) feature_table.tsv, 3) score_threshold_.XX , 5) tpm_threshold_YY (YY=0 if ATAC-only model), 4) qnorm_reference.tsv.gz (single column with header E2G.Score that contains raw scores for genomewide predictions)

Running the pipeline:

snakemake -s workflow/Snakefile_training -j1 --use-conda

Output will show up in the results_training/ directory by default

About

Pipeline to run scE2G

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  
0