8000 GitHub - Sentieon/sentieon-dnascope-ml: Sentieon DNAscope + Machine Learning Model
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Sentieon/sentieon-dnascope-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

Sentieon

DNAscope Machine Learning Model

A machine learning model for accurate and efficient germline small-variants detection

Sentieon DNAscope combines the robust and well-established preprocessing and assembly mathematics of the GATK’s HaplotypeCaller with a machine-learned genotyping model, achieving superb SNP and insertion/deletion accuracy as compared to state-of-the-art tools, while using much reduced computational cost.

Table of Contents

From Sentieon software version 201808.01 onwards, DNAscope allows you to use a model to perform variant calling with higher accuracy by improving the candidate detection and filtering.

Sentieon can provide you with a model trained using a subset of the data from the GIAB truth-set found in https://github.com/genome-in-a-bottle. The model was created by processing samples HG001 and HG005 through a pipeline consisting of Sentieon BWA-mem alignment and Sentieon deduplication, and using the variant calling results to calibrate a model to fit the truth-set. Sentieon also provides DNAscope model training tool for you to create your model based on your own data.

  • High Accuracy
  • Rapid Turnaround Time
  • Cost Reduction
  • Easy Deployment
  • Customizable Model

pipeline

Running DNAscope on-premises

FASTQ -> BAM -> VCF

  1. Sentieon license

    Update Sentieon packages location and license file in dnascope.sh.

    export SENTIEON_INSTALL_DIR=/home/release/sentieon-genomics-201808.06 #your Sentieon package location
    export SENTIEON_LICENSE=/home/bundle/sentieon.lic #your license file location

    If you do not have a Sentieon license/package yet, please feel free to request a free trial at sentieon.com.

  2. Location of input files

    Before running, you need to set the following variables in dnascope.sh.

    • fastq_folder: fastq file(s) folder

    • fastq_1: fastq file name

    • fastq_2: second fastq file, if using Illumina paired data

    • model: DNAscope model file

    • PCRFREE: boolean to indicate whether the sample is PCR Free or not. Set to true for PCR Free samples.

    • fasta: reference file

    • dbsnp: dbSNP file

  3. Running the pipeline

    sh dnascope.sh

If you have any further question, please refer to Sentieon's Appnotes for DNAscope Machine Learning Model and DNAseq pipeline example script in the manual.

Here we demonstrate DNAscope's performance on PrecisionFDA Truth Challenge HG002 sample.

Software source

Sentieon packages/models are stored on AWS. You could get Sentieon tools by running:

 SENTIEON_VERSION="version-you-want" #For example 201808.07, Find released versions here https://support.sentieon.com/manual/appendix/releasenotes/
 INSTALL_DIR="your-install-dir"
 wget -nv -O - "https://s3.amazonaws.com/sentieon-release/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz" | tar -zxf - -C ${INSTALL_DIR} 

Sentieon Models:

https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModel1.0.model

Data source

  1. Reference Genome: hs37d5.fa.
  2. Test Data: NIST HG002 from PrecisionFDA Truth Challenge.
  3. Model File: Sentieon DNAscope Machine Learning Model version 0.5.
  4. Truth VCF and BED file for evaluation: NIST GIAB project, version v3.3.2. (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/)

You could access these data from our google cloud bucket:

File Location
Reference Genome gs://sentieon-test/pipeline_test/reference/
PrecisionFDA Truth Challenge HG002 FASTQs gs://sentieon-dnascope-model/data/
DNAscope Model gs://sentieon-dnascope-model/models/SentieonModelBeta0.5.model
Truth VCF and Bed files gs://sentieon-dnascope-model/truth/

Variant Calling Performance

We use RTG's vcfeval + hap.py (https://github.com/Illumina/hap.py) for variants evaluation, same comparison methodology as used in precisionFDA Truth Challenge.

HAPPY="/opt/hap.py/bin/hap.py"
OUTPREFIX="happy_eval"
$HAPPY truth.vcf query.vcf -f truth.bed -o $OUTPREFIX -r hs37d5.fa --engine=vcfeval --engine-vcfeval-template hs37d5.sdf

You could find DNAscope output as well as the hap.py evaluation results published on Google Storage Buckets gs://sentieon-dnascope-model/ under output/ and happy_eval/ directories.

Type TP FN FP Recall Precision F1_score PrecisionFDA Truth Challenge Winning Fscore*
SNP 3046378 1459 700 0.999521 0.99977 99.9646% 99.9587%
INDEL 463754 1010 685 0.997827 0.998585 99.8206% 99.4009%

*PrecisionFDA Truth Challege Result taken from https://precision.fda.gov/challenges/truth/results

If you are interested in other Sentieon Products, please visit www.sentieon.com for more information.

0