Machine learning detection of unstable antibiotic heteroresistance in E. coli

This repository contains code and data for the heteroresistance detection project published in eBioMedicine.

The pipelines were created using Snakemake v7.32.4

Data analysis was performed using R v4.4.1 and machine learning was performed using tidymodels v1.2.0.

This project contains 3 pipelines:

(1) to run hybrid assembly of genomes and annotation of resistance genes, direct repeats and IS elements.
(2) to run analysis of HR mutants
(3) to run core-genome based phylogeny

The pipelines have been tested on Ubuntu 22.04.5 with conda v23.1.0 and mamba v1.1.0

Important notes

(1) Hybrid assembly script will not work if your system doesn't have libgsl.so.25 and libcblas.so.3 (both required by bcftools which is required by medaka=1.6.0). For some reason, the wrong versions of these libraries are installed by conda when solving the environment. In such case use files in workflow/libs - copy or link them to the environment's lib directory.

(2) If you choose to run the pipelines in containers, be ready that some environments may not be resolved. Although, everything worked on 'vanilla' Ubuntu 22.04.5. Maybe try only conda environments first. If it works, then no need to use containers.

How to run the assembly-annotation pipeline (1) + ML

download the repository using:

git clone https://github.com/andrewgull/HeteroR

Navigate to the HeteroR directory (in all following steps, we will assume that you are inside this directory).

create a directory for raw reads

mkdir -p resources/raw/

download the raw reads (in NCBI's SRA database: PRJNA1165464) to this directory. Naming convention: long reads have .fastq.gz extension and paired short reads have .fq.gzextension. By default reads downloaded from SRA have names like "SRR followed by 8 digits", this project assumes theat all the read files are named after the in-house naming scheme which is "DA followed by 6 digits". To determine how the SRR numbers correspond to DA numbers, refer to the table configs/da_srr_mapping.csv.
One more piece of data you need is CARD database. You can download the newest version using:

wget https://card.mcmaster.ca/latest/data

and then follow the instructions on the CARD website on how to get the actual database.

OR you can use the version we used for this project which is in localDB.tgz archive. Just unpack it and make sure it's inside the project's directory, i.e. inside HeteroR:

tar -xf localDB.tgz

The corresponding step of the pipleine will find this database and use it for resistance gene identification.

install conda/mamba and snakemake
activate snakemake environment:

conda activate snakemake

run the pipeline using this command:

# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/assembly-annotation.smk --use-conda  --cores $N

note: add --use-singularity if you want to run the analysis inside a container (check the Important notes above!).

After the main pipeline has finished, you can run the three R notebooks (but not necessarily all of them):

to generate features table: notebooks/modelling/features.qmd (also, a pre-compiled table is available here notebooks/modelling/data/features_strain.csv)
for exploratory data anlysis: notebooks/modelling/EDA.qmd (pre-compiled HTML file is available here notebooks/modelling/EDA.html.gz)
to run training and validation:notebooks/modelling/training_and_validation.Rmd (pre-compiled HTML file is available here notebooks/modelling/training_and_validation.html.gz)
for comparison and analysis of the models: notebooks/modelling/models_analysis.Rmd (pre-compiled HTML file is available here notebooks/modelling/models_analysis.html.gz)

To ensure that you use the same versions of R packages as were used in these notebooks, install renv package and run renv::restore() (here you can find renv documentation).

How to run the additional analyses

Analysis of the HR mutants (2)

place mutant read files in resources/raw/mutants. Naming convention: {parent_strain_name}_[1,2].fq.gz. To determine parental strains of the mutants, refer to the file configs/parent_mutant_srr_mapping.csv
run the pipeline with the following command:

# analysis of the HR mutants
# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/mutants.smk --use-conda --use-singularity --cores $N

remove --use-singularity if you want to use only conda environments.

Phylogenetic analysis (3)

place the genomes (assemblies) of the 31 reference strains in results/assemblies_joined/. The names of these strains are provided in configs/strains_phylogeny.txt. Each genome should be placed in its own directory named accordig to this list of strains. Each assembly file should be named assembly.fasta (the same way as with the 474 collection strains from the pipeline (1)).
run the following command:

# substitute $N with a number of threads you want to use
snakemake --snakefile workflow/phylogeny.smk --use-conda --use-singularity --cores $N

remove --use-singularity if you want to use only conda environments.

NB: to find NCBI accession numbers of the reference strains, refer to file configs/reference_strains.csv.

You can change the reference strain names provided in the strain list to whichever suits you better.

Configuration & Settings

Lists of strain names used in each of the pipelines can be found in configs/strains_*.txt files.

Settings of each software tool used can be found in configs/config_*.yaml files.

Software versions are specified in yaml files located in workflow/envs.

Raw data availability

The raw sequencing reads used in this project are available from NCBI's SRA under BioProjects PRJNA1165464 (474 parental strains), PRJNA1083935 and PRJNA1160527 (mutants).

Models and features table

The pre-compiled features table is available in notebooks/modelling/data/features_strain.csv

The final models (trained LLR and GBT) are available in notebooks/modelling/models.

Rule graphs

The main analysis
HR mutants analysis

Name		Name	Last commit message	Last commit date
Latest commit History 1,619 Commits
configs		configs
images		images
notebooks		notebooks
results/phylogeny/tree		results/phylogeny/tree
workflow		workflow
.gitignore		.gitignore
README.md		README.md
localDB.tgz		localDB.tgz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine learning detection of unstable antibiotic heteroresistance in E. coli

How to run the assembly-annotation pipeline (1) + ML

How to run the additional analyses

Analysis of the HR mutants (2)

Phylogenetic analysis (3)

Configuration & Settings

Raw data availability

Models and features table

Rule graphs

About

Releases

Contributors 2

Languages

andrewgull/HeteroR

Folders and files

Latest commit

History

Repository files navigation

Machine learning detection of unstable antibiotic heteroresistance in E. coli

How to run the assembly-annotation pipeline (1) + ML

How to run the additional analyses

Analysis of the HR mutants (2)

Phylogenetic analysis (3)

Configuration & Settings

Raw data availability

Models and features table

Rule graphs

About

Resources

Stars

Watchers

Forks

Releases

Contributors 2

Languages