mVIRs: Localisation of inducible prophages using NGS data

mVIRs: Localisation of inducible prophages using NGS data
Installation
- Installation using conda
- Manual installation
Usage
- INDEX
- OPRS
Output files
Concepts
- IPRs, OPRs and SAME
  - Algorithm
- Clipped Alignments

mVIRs is a tool that locates the integration sites of inducible prophages in bacterial genomes. It extracts information on the alignment orientation of paired-end Illumina reads that are mapped to lysogenic host reference genome. The aim is to identify DNA segments that are predicted to exist in circularized or concatenated form after induction. These segments can be length-filtered and classified by prediction tools, such as VirSorter2, VirFinder, VIBRANT or Prophage Hunter, to identify putative prophage candidates.

The tool was developed by Mirjam Zuend, Hans-Joachim Ruscheweyh and Shinichi Sunagawa. It is distributed under .

If you use mVIRs, please cite:

Zünd M, Ruscheweyh HJ, Field CM, Meyer N, Cuenca M, Hoces D, Hardt WD, Sunagawa S. High throughput sequencing provides exact genomic locations of inducible prophages and accurate phage-to-host ratios in gut microbial strains. Microbiome, 2021.

Analyses in the publication were executed using version 1.0.0.

Questions/Comments? Write a github issue.

Installation

The tools is written in python and has the following dependencies:

Python>=3.7
- pysam>=0.15.2
BWA (tested: v0.7.17-r1188)
samtools (tested: v1.9)

Installation using conda

The easiest way to install mVIRs is to use the conda package manager, which will automatically creat an environment with the correct versions of the dependencies.

# Install dependencies
$ git clone https://github.com/SushiLab/mVIRs
$ cd mVIRs   

$ conda env create -f conda_env_mvirs.yaml
$ conda activate mvirs   

#Installs the mVIRs package 
$ pip install -r requirements.txt -e .
# Add --user for installation with user permissions only


#Test
$ mvirs -h

2021-06-08 11:00:57,878 INFO: Starting mVIRs

Program: mVIRs - Localisation of inducible prophages using NGS data
Version: 1.1.0
Reference: Zuend, Ruscheweyh, et al.
High throughput sequencing provides exact genomic locations of inducible
prophages and accurate phage-to-host ratios in gut microbial strains.
Microbiome (2021). doi:10.1186/s40168-021-01033-w

Usage: mvirs <command> [options]
Command:

    oprs    align reads against reference and used clipped
            alignment positions and OPRs to extract potential
            prophages

    index   create index files for reference used in the
            mvirs oprs routine
            
    test    run mVIRs for a public dataset


mvirs.py: error: the following arguments are required: command

Manual installation

Manual installation is possible but not recommended. Install via pip after installation of dependencies:

$ git clone https://github.com/SushiLab/mVIRs
$ cd mVIRs

#Installs the package locally
$ pip install -r requirements.txt -e .
# Add --user for installation with user permissions only

#Test
$ mvirs -h

Program: mVIRs - Localisation of inducible prophages using NGS data
Version: 1.1.0
Reference: Zuend, Ruscheweyh, et al.
High throughput sequencing provides exact genomic locations of inducible
prophages and accurate phage-to-host ratios in gut microbial strains.
Microbiome (2021). doi:10.1186/s40168-021-01033-w

Usage: mvirs <command> [options]
Command:

    oprs    align reads against reference and used clipped
            alignment positions and OPRs to extract potential
            prophages

    index   create index files for reference used in the
            mvirs oprs routine


mvirs.py: error: the following arguments are required: command

Usage

The mVIRs toolkit includes 3 functions, oprs, index and test. The index commands takes a reference file as input and builds the index that is needed for the execution of the oprs command. The oprs command aligns paired-end reads against the reference database to detect so called outward orientated paired-end reads (OPRs) and uses soft-clipped alignments (clipped reads) to extract regions from the database that are potential prophages. The test function executes mVIRs on a test dataset that is downloaded as part of this function.

INDEX

This step takes the reference sequence file as input and builds an bwa index using the bwa index command. This command has to be executed before running the oprs command

Program: mVIRs - Localisation of inducible prophages using NGS data
Version: 1.1.0
Reference: Zuend, Ruscheweyh, et al.
High throughput sequencing provides exact genomic locations of inducible
prophages and accurate phage-to-host ratios in gut microbial strains.
Microbio
8000
me (2021). doi:10.1186/s40168-021-01033-w

Usage: mvirs index [options]

    Input:
        -f  FILE   Reference FastA file. Can be gzipped. [Required]

# The index subcommand can then be executed with the following command:

$ mvirs index -f reference.fasta

OPRS

This step needs two sequence read files and a reference database (produced from mvirs index). It aligns the reads against the database and uses the alignments information to extract potential prophage sequences from the reference genome using coverage information from OPRS and Clipped reads.

$ mvirs oprs

Program: mVIRs - Localisation of inducible prophages using NGS data
Version: 1.1.0
Reference: Zuend, Ruscheweyh, et al.
High throughput sequencing provides exact genomic locations of inducible
prophages and accurate phage-to-host ratios in gut microbial strains.
Microbiome (2021). doi:10.1186/s40168-021-01033-w

Usage: mvirs oprs [options]

    Input:
        -f  FILE   Forward reads file. Can be gzipped. [Required]
        -r  FILE   Reverse reads file. Can be gzipped. [Required]
        -db FILE   BWA reference. Has to be created upfront. [Required]

    Output:
        -o  PATH   Prefix for output file. [Required]

    Options:
        -t  INT    Number of threads. [1] 
	-ml INT    Minimum length to extract. [4000]
        -ML INT    Maximum length to extract. [800000]
        -m         Allow full scaffolds to be reported 
	           (When OPRs and Clipped reads are found at start and end of scaffolds)

# Example
$ mvirs oprs -f reads.1.fq.gz -r reads.2.fq.gz -db reference.fasta -o mvirs.output


# Will produce the following files (see below for explanation of the files)
# mvirs.output.bam --> Alignments
# mvirs.output.oprs --> The raw OPRS positions
# mvirs.output.clipped --> The raw clipped alignment positions
# mvirs.output.fasta --> The potential prophage regions

TEST

$ mvirs.py test

2021-09-20 08:44:31,283 INFO: Starting mVIRs
Program: mVIRs - Localisation of inducible prophages using NGS data
Version: 1.1.0
Reference: Zuend, Ruscheweyh, et al.
High throughput sequencing provides exact genomic locations of inducible
prophages and accurate phage-to-host ratios in gut microbial strains.
Microbiome (2021). doi:10.1186/s40168-021-01033-w
Usage: mvirs test [options]

    Input:
        -o  PATH   Output folder [Required]

# Example
$ mvirs test ~/mVIRs_test/
# Will produce the following files (see below for explanation of the files)
# ERR4552622_100k_mVIRs.bam --> Alignments
# ERR4552622_100k_mVIRs.oprs --> The raw OPRS positions
# ERR4552622_100k_mVIRs.clipped --> The raw clipped alignment positions
# ERR4552622_100k_mVIRs.fasta --> The potential prophage regions

Output files

mvirs oprs produces 4 output files (mvirs.output.bam. mvirs.output.oprs, mvirs.output.clipped and mvirs.output.fasta) of which we will explain the latter 3.

mvirs.output.oprs

The mvirs.output.oprs file lists the inserts of paired-end reads that align either with an unusual orientation (e.g. OPR or SAME) or have an unexpected large insert size (IPR) when compared to the estimated insert size.

The columns of the file are:

MIN_REASONABLE_INSERTSIZE: The low boundary for regular insert sizes
MAX_REASONABLE_INSERTSIZE: The high boundary for regular insert sizes
ESTIMATED_INSERTSIZE: ??
READNAME: The name of the insert
REFERENCE: The name of the reference
INSERT_SIZE: The insert size of this insert
R1_ORIENTATION: The orientation of the r1 read on the reference
R2_ORIENTATION: The orientation of the r2 read on the reference
BWA_SCORE: The additive score reported by bwa for this insert
R1_START: The lowest coordinate on the reference where the r1 read aligns
R2_START: The lowest coordinate on the reference where the r2 read aligns
R1_ALNLENGTH: The length of the alignment of the r1 read on the reference
R2_ALNLENGTH: The length of the alignment of the r2 read on the reference
READ_ORIENTATION: The orientation of the both reads to each other. Can either be IPR, OPR or SAME

An example output is below:

#MIN_REASONABLE_INSERTSIZE=0
#MAX_REASONABLE_INSERTSIZE=1628
#ESTIMATED_INSERTSIZE=

#READNAME	REFERENCE	INSERT_SIZE	R1_ORIENTATION	R2_ORIENTATION	BWA_SCORE	R1_START	R2_START	R1_ALNLENGTH	R2_ALNLENGTH	INSERT_ORIENTATION
K00206:180:H2CJWBBXY:8:1107:6644:49230	SalmonellaLT2	41477	forward	reverse	297	1255437	1214111	151	147	OPR
K00206:180:H2CJWBBXY:8:1107:7182:12181	SalmonellaLT2	41392	forward	reverse	288	1255606	1214365	151	143	OPR
K00206:180:H2CJWBBXY:8:1107:7436:46873	SalmonellaLT2	41449	reverse	forward	302	1214126	1255424	151	151	OPR
K00206:180:H2CJWBBXY:8:1107:8582:43304	SalmonellaLT2	1351429	reverse	reverse	225	4216570	2865291	150	80	SAME
K00206:180:H2CJWBBXY:8:1107:9404:2176	SalmonellaLT2	41222	forward	reverse	291	1255124	1214053	151	145	OPR
K00206:180:H2CJWBBXY:8:1107:10470:14959	SalmonellaLT2	41453	reverse	forward	302	1214268	1255570	151	151	OPR
K00206:180:H2CJWBBXY:8:1107:10724:29958	SalmonellaLT2	140	reverse	forward	201	475072	475135	140	76	OPR

mvirs.output.clipped

The mvirs.output.clipped file contains name, orientation and position of aligned reads that were clipped (An alignment is clipped when not the entire read can be mapped consecutively).

The columns of the file are:

Insert: Name of the insert
READORIENTATION: Orientation of thr read (R1 or R2)
HARD/SOFTCLIP: Reported clip type by BWA (Soft --> longer part of the alignment. Hard --> shorter part of the alignment)
DIRECTION: Direction of the alignment in respect to the reference sequence
POSITON: Leftmost aligned based on the reference sequence
SCAFFOLD: Name of the scaffold/genomic region of the reference sequence

An example output is below:

#INSERT	READORIENTATION	HARD/SOFTCLIP	DIRECTION	POSITION	SCAFFOLD
K00206:180:H2CJWBBXY:8:1101:1336:1982	R2	S	->	252072	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1418:9895	R2	S	<-	1946641	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1468:42231	R2	S	->	1492581	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1864:3670	R2	S	<-	2886283	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1915:46346	R1	S	<-	1255756	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1915:46346	R1	H	->	1213986	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:1996:25738	R2	S	<-	4147832	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:2037:2457	R2	S	->	465110	SalmonellaLT2
K00206:180:H2CJWBBXY:8:1101:2087:10422	R1	S	<-	4769629	SalmonellaLT2

mvirs.output.fasta

The mvirs.output.fasta is a fasta file with the potential prophage regions that were extracted from the reference. The header of the fasta includes information on source scaffold, start and end coordinates of the potential prophage, number of supporting OPRS, the number of supporting clipped alignments and the fraction of the scaffold that is covered by the extracted region.

>SalmonellaLT2:1213986-1255756	ORPs=3868-HSs=1473-SF=0.852597
ATTCGTAATGCGAAGGTCGTAGGTTCGACTCCTATTATCGGCACCAGTTAAATCAAATACTTAC...
>SalmonellaLT2:1849457-1892188	ORPs=743-HSs=338-SF=0.872212
TCCTTTCAGTGATTGCATAACCACTTAACATCTTGTTTTATCTAAATAAAATTAAGCATGTTAT...
>SalmonellaLT2_plasmid:0-93927  OPRs=15-HSs=10-SF=100.000000
ACAGCGGCCGCCGAATTTCATACGCAGTCTTTCCCGCTGTTTTCTGGTCAGCGCCATAACATCT...

# SalmonellaLT2:1213986-1255756 --> Scaffold:START-STOP
# ORPs=3868 --> Number of OPRs = 3868
# HSs=1473 --> Number of clipped alignments at the location of the OPR = 1473
# SF=0.852597 --> 0.85% of the scaffold is covered by the extracted region

Concepts

IPRs, OPRs and SAME

A paired-end read can align in the following orientations:

# IPRs --> When the insert orientation matches the reference sequence

REFERENCE ---------------------------------------
R1              -------->
R2                        <--------

# OPRs --> When the insert orientation doesnt match the reference genome, e.g. circularized virus particles

REFERENCE ---------------------------------------
R1              <--------
R2                        -------->


# SAME -> When the insert orientation doesnt match the reference genome, e.g. rearrangements

REFERENCE ---------------------------------------
R1              -------->
R2                        -------->

This tool reports IPRs with unreasonable insert sizes and OPRs.

Algorithm

The algorithm works the following:

Reads were grouped to inserts by name. Upper and lower maxima for insert sizes of paired inserts are estimated using the mean insert sizes and +/- 7 StDev's of uniquely mapping inward-oriented paired-end reads (IPRs).
OPRs are detected the following: For each insert

Find the best scoring alignment pairs within 3% of the best alignment score.
Report the insert as OPR if there is no IPR with reasonable insert size within the 3% cutoff and if the OPR is the best scoring alignment.

Clipped Alignments

The alignment of a read is clipped if it can't align against the reference in full length. Two reasons why alignments can be clipped are:

A circular chromosome is linearized. Reads that align to the beginning of the linearized chromosome (here named reference) will also align at the end. A single full length alignment is not possible. E.g. The first three bases of the read align at the end of the genome. The last 3 bases at the beginning
```
REFERENCE ---------------------------------------
READ      ---                                 ---
          456                                 123  
```
Or assuming a phage is integrated in the chromosome but the read comes from a the same, but circularized phage.
```
# Phage integrated in reference chromosome denoted as P
REFERENCE ----------------PPPPPPPPPPP------------
READ                      ---     ---
                          456     123
```
The reference contains an element (e.g an integrated phage) that was not part of the genome from which the read was sequenced. This can happen when genomes from the same bacterial species contain or not contain an integrated phage. E.g here the bases denoted as P are from a phage that is absent in the read.
```
REFERENCE ----------------PPPPPPPP---------------
READ                   ---        ---
                       123        456
```

Regions where an accumulation of both cases and OPRs are detected are reported as potential phages in the output fasta file.

Algorithm

OPRs have start and end positions but are not precise enough to detect exact excision positions. Clipped alignments are precise but often miss part of the alignment due to multimappers or the restriction on minimal alignment length. Therefore a region must have at least 1 OPR and at least 1 clipped alignment to be considered as a potential site for an integrated phage. Further, potential sites must have a sum of 5 or more OPRs and clipped alignments total.

Sites are also filtered by length, with a minimum requirement of 4kb and a maximum of 800kb. These limits can be modified with the -ml (minimum) and -ML (maximum) parameters on the command line.

If the -m flag is set, sites that cover entire scaffolds will be reported, otherwise they will be discounted.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
mVIRs		mVIRs
pics		pics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda_env_mvirs.yaml		conda_env_mvirs.yaml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mVIRs: Localisation of inducible prophages using NGS data

Installation

Installation using conda

Manual installation

Usage

INDEX

OPRS

TEST

Output files

mvirs.output.oprs

mvirs.output.clipped

mvirs.output.fasta

Concepts

IPRs, OPRs and SAME

Algorithm

Clipped Alignments

Algorithm

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

SushiLab/mVIRs

Folders and files

Latest commit

History

Repository files navigation

mVIRs: Localisation of inducible prophages using NGS data

Installation

Installation using conda

Manual installation

Usage

INDEX

OPRS

TEST

Output files

mvirs.output.oprs

mvirs.output.clipped

mvirs.output.fasta

Concepts

IPRs, OPRs and SAME

Algorithm

Clipped Alignments

Algorithm

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages