This repository contains code and data for the heteroresistance detection project published in eBioMedicine.
The pipelines were created using Snakemake v7.32.4
Data analysis was performed using R v4.4.1 and machine learning was performed using tidymodels v1.2.0.
This project contains 3 pipelines:
-
(1) to run hybrid assembly of genomes and annotation of resistance genes, direct repeats and IS elements.
-
(2) to run analysis of HR mutants
-
(3) to run core-genome based phylogeny
The pipelines have been tested on Ubuntu 22.04.5 with conda v23.1.0 and mamba v1.1.0
Important notes
(1) Hybrid assembly script will not work if your system doesn't have libgsl.so.25 and libcblas.so.3 (both required by bcftools which is required by medaka=1.6.0).
For some reason, the wrong versions of these libraries are installed by conda when solving the environment. In such case use files in workflow/libs
- copy or link them to the environment's lib
directory.
(2) If you choose to run the pipelines in containers, be ready that some environments may not be resolved. Although, everything worked on 'vanilla' Ubuntu 22.04.5. Maybe try only conda environments first. If it works, then no need to use containers.
- download the repository using:
git clone https://github.com/andrewgull/HeteroR
Navigate to the HeteroR
directory (in all following steps, we will assume that you are inside this directory).
- create a directory for raw reads
mkdir -p resources/raw/
-
download the raw reads (in NCBI's SRA database: PRJNA1165464) to this directory. Naming convention: long reads have
.fastq.gz
extension and paired short reads have.fq.gz
extension. By default reads downloaded from SRA have names like "SRR followed by 8 digits", this project assumes theat all the read files are named after the in-house naming scheme which is "DA followed by 6 digits". To determine how the SRR numbers correspond to DA numbers, refer to the tableconfigs/da_srr_mapping.csv
. -
One more piece of data you need is CARD database. You can download the newest version using:
wget https://card.mcmaster.ca/latest/data
and then follow the instructions on the CARD website on how to get the actual database.
OR you can use the version we used for this project which is in localDB.tgz
archive. Just unpack it and make sure it's inside the project's directory, i.e. inside HeteroR
:
tar -xf localDB.tgz
The corresponding step of the pipleine will find this database and use it for resistance gene identification.
-
install conda/mamba and snakemake
-
activate snakemake environment:
conda activate snakemake
- run the pipeline using this command:
# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/assembly-annotation.smk --use-conda --cores $N
note: add --use-singularity
if you want to run the analysis inside a container (check the Important notes above!).
After the main pipeline has finished, you can run the three R notebooks (but not necessarily all of them):
- to generate features table:
notebooks/modelling/features.qmd
(also, a pre-compiled table is available herenotebooks/modelling/data/features_strain.csv
) - for exploratory data anlysis:
notebooks/modelling/EDA.qmd
(pre-compiled HTML file is available herenotebooks/modelling/EDA.html.gz
) - to run training and validation:
notebooks/modelling/training_and_validation.Rmd
(pre-compiled HTML file is available herenotebooks/modelling/training_and_validation.html.gz
) - for comparison and analysis of the models:
notebooks/modelling/models_analysis.Rmd
(pre-compiled HTML file is available herenotebooks/modelling/models_analysis.html.gz
)
To ensure that you use the same versions of R packages as were used in these notebooks, install renv package and run renv::restore()
(here you can find renv documentation).
-
place mutant read files in
resources/raw/mutants
. Naming convention:{parent_strain_name}_[1,2].fq.gz
. To determine parental strains of the mutants, refer to the fileconfigs/parent_mutant_srr_mapping.csv
-
run the pipeline with the following command:
# analysis of the HR mutants
# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/mutants.smk --use-conda --use-singularity --cores $N
remove --use-singularity
if you want to use only conda environments.
-
place the genomes (assemblies) of the 31 reference strains in
results/assemblies_joined/
. The names of these strains are provided inconfigs/strains_phylogeny.txt
. Each genome should be placed in its own directory named accordig to this list of strains. Each assembly file should be namedassembly.fasta
(the same way as with the 474 collection strains from the pipeline (1)). -
run the following command:
# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/phylogeny.smk --use-conda --use-singularity --cores $N
remove --use-singularity
if you want to use only conda environments.
NB: to find NCBI accession numbers of the reference strains, refer to file configs/reference_strains.csv
.
You can change the reference strain names provided in the strain list to whichever suits you better.
Lists of strain names used in each of the pipelines can be found in configs/strains_*.txt
files.
Settings of each software tool used can be found in configs/config_*.yaml
files.
Software versions are specified in yaml files located in workflow/envs
.
The raw sequencing reads used in this project are available from NCBI's SRA under BioProjects PRJNA1165464 (474 parental strains), PRJNA1083935 and PRJNA1160527 (mutants).
The pre-compiled features table is available in notebooks/modelling/data/features_strain.csv
The final models (trained LLR and GBT) are available in notebooks/modelling/models
.