This repository contains all code and analysis needed to reproduce the genome assembly and analysis of Complete genome sequence of a virulent barcoded Mycobacterium tuberculosis str. Erdman commonly used for non-human primate infection studies.
NCBI RefSeq assembly: GCA_044324775.1
NCBI BioSample: SAMN43777470
NCBI Bioproject: PRJNA1161419
The sequencing reads (Oxford Nanopore & Illumina sequencing) used for genome assembly are available under SRA Bioproject ID PRJNA1161419
All tools (+ parameters) used for producing the final Erdman assembly can be found in the scripts/
directory.
Information about individual steps can be found in scripts/README.md
.
Bakta (v1.5) was used to produce an automated genome annotation for the Erdman genome.
The output of Bakta can be found in Results/Bakta_annotation
.
To produce a high quality genome annotation of the new Erdman genome assembly, the following general steps were performed:
- RATT was used for the initial liftover of H37Rv's annotated features.
- In cases where genomic regions were not able to be annotated via liftover using RATT, automated annotations produced by Bakta were used instead.
- This was followed by manual curation of the integrated plasmid sequence.
The manual curation and matching of gene annotations to the H37Rv equivalent was done to maximize the utility of using this new Erdman genome as a reference.
All code related to merging and curation of genome annotation for can be found in Analysis/Annotation/1.annotation.ipynb
.
🚧 Check back soon 🚧
This repository is distributed under the MIT license terms.