8000 GitHub - vitorpavinato/PypeAmplicon: Pipeline to automate the processing of amplicon data.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vitorpavinato/PypeAmplicon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

PypeAmplicon v1.0: Python pipeline for analysis of amplicon data

Authors:

Saranga Wijeratne & Vitor A. C. Pavinato

CITATION

Wijeratne S & Pavinato VAC. (2018, November 17). PypeAmplicon: Python pipeline for analysis of amplicon data (Version v1.0). Zenodo. doi: http://doi.org/10.5281/zenodo.1490421

Pavinato VAC, Wijeratne S, Spacht D, Denlinger DL, Meulia T, Michel AP. (2019, September 19). Leveraging targeted sequencing for non-model species: a step-by-step guide to obtain a reduced SNP set and a pipeline to automate data processing in the Antarctic Midge, Belgica antarctica. bioRxiv, 772384. doi: https://doi.org/10.1101/772384

INSTALLATION

You should install the softwares listed below and the required python packages (see main script) in order to run the pipeline. For the softwares installation, follow their instructions provided. For the python packages, install them either by pip or anaconda depending how you managed your python installation you have.

Third part softwares:

USAGE

python pypeamplicon.py pypeamplicon.cfg

CONFIGURATION FILE

In order to run the pipeline you should speficy the configuration file. This file contains paths and necessary information to run the pipeline. Below are detailed information in how to set each configuration section.

[SOFTWARES]

You should install and specify their path.

[REF]

You should specify the path for the file containing the FASTA sequence you are using as a reference sequence. It can be either the reference genome/scaffold OR the fragment homologous to the amplicon.

[FQDIR]

You should specify where the .fastq files were stored.

[WORKINGDIR]

You should specify your working directory. We suggest the following organization:

  • <Project_folder> as the working directory;
    • <reference_file> Where you store your reference file;
    • <raw_files> Where you store your raw fastq files;

[COMMANSUFFIX]

In this section you can specify the type of the sequencing data you have

  • PE: for paired-end data;
  • SE: for single-end data;

For PE data you should provide the sulfix for the first-read files (R1) and for the second-read files (R2). However, in order to work, your raw fastq files should have this sulfix at the end of the file name (right before .fastq).

[CLUSTERFILTER]

Clustering filter section allows you to control the behavior of two parameters that define the filtering of your raw data:

  1. filtercutoff: the maximum percentage of divergence allowed between reads within a cluster of reads;
  2. idcutoff: the minimum proportion of similarities that clusters of reads should have to be considered the same cluster;

POSTPROCESSING SCRIPTS

At the final stage of the pipeline, the variant calling is performed using freebayes. However if, for some reason, you would like to have more control at this stage, we provide a sh script that runs freebayes. The parameters set in this script were tested in three independet datasets. You can modify it to fit your needs.

Script fre 6917 ebayes-pypeamplicon.sh

sh freebayes-pypeamplicon.sh <input.bam> <output.vcf> <populations-file>

The population-file is a tab limited file containing in each row:

<sample_id> tab <pop_id>

If you are working with amplicon sequencing and have a list of targeted SNPs (already tested in other experiments), you probably are interested in process the raw vcf of the amplicon sequencing experiment to remove the non-targeted snps. We have two other sh scripts that allow you to:

  1. Keep only the targeted SNPs in the filtered .vcf file;
  2. Keep only the non-target SNPs in another .vcf file.

To keep only the targeted SNPs

sh vcfintersect-targets.sh <input.vcf> <output.vcf>

To keep only the non-targeted SNPs

sh vcfintersect-non-targets.sh <input.vcf> <output.vcf>

In both scripts you need to specify a BED file that contains the position of the targeted SNPs.

FUTURE IMPLEMENTATIONS

  • include an assembly step to deal with amplicons, from the same region, with overlapping ends (w/out reference);
  • extende the pipeline for cases when the reference sequence is not available;
  • an step to infer ploidy levels based on the number of cluster of each targeted locus;
  • a likelihood-based and a simple count method to infer the genotype of variant position;
  • extend the genotyping of SNPs to the most likely haplotypes for each locus/sample;
  • an installation step that install all dependencies;
  • add argument "sam=1.3" to fix cigar incompatibilities.

About

Pipeline to automate the processing of amplicon data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published
0