8000 GitHub - afproject-org/afproject: Codebase that benchmarks alignment-free sequence comparison methods
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

afproject-org/afproject

Repository files navigation

afproject

Codebase that benchmarks alignment-free sequence comparison methods. The code is used in publicly available webservice AFproject (http://afproject.org).

This repository contains Python scripts to benchmark an alignment-free methods offline.

Requirements

The following software tools are required:

  1. fneighbor from the EMBOSS package
  2. tqdist

The following Python packages are required:

  1. numpy for general calculations
  2. scikit-learn for calculating ROC curves and AUC.
  3. ete3 for calculating Robinson-Foulds distance

Usage

Genome-based phylogeny & horizontal gene transfer

To calculate accuracy with fish_mito dataset from the genome-based phylogeny category for input file in TSV format, just run:

python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.tsv --format tsv --reference fish_mito

You will get the information about normalized Robinson-Foulds (nRF) distance and normalized Quartet Distance (nQD).

nRF:     1.00
RF:      44.0
max RF:  44.0
nQD:     0.6897

You can provide your input files in TSV / PHYLIP / Newick format.

python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.phy --format phylip --reference fish_mito
python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.newick --format phylip --reference fish_mito

Regulatory elements

To calculate performance in the CRM dataset for your input PHYLIP file, just run:

python benchmark_genreg_crm.py --input example_input/genreg/crm_random.phy --format phylip

You will get information about your method's performance across all 7 tissues:

Tissue	k	n	percent
FB	151	300	50.3
FP	127	253	50.2
FT	17	36	47.2
HM	149	300	49.7
HL	17	36	47.2
HH	72	136	52.9
FE	68	136	50.0

Weighted average:	50.21
Standard deviat.:	1.83
Average:         	49.65

Protein sequence classification

To calculate performance in the protein classification category, jus run:

python benchmark_protein.py -i example_input/protein/low-ident/low-ident_random.tsv -f tsv -r low-ident

You will get information about your method's performance across 4 structural levels:

AUC	class	     0.500
AUC	fold	     0.499
AUC	superfamily	 0.504
AUC	family	     0.502
Mean AUC:        0.501
Stdev AUC:       0.002

For protein sequences with high identity:

python benchmark_protein.py -i example_input/protein/high-ident/high-ident_random.tsv -f tsv -r high-ident

Gene Tree Inference

To calculate performance in the gene tree category, jus run:

python benchmark_genetree.py -i example_input/genetree/swisstree_random.tsv -f tsv

You will get information about your method's performance across 4 structural levels:

ST001	nRF	0.97
ST002	nRF	1.0
ST003	nRF	1.0
ST004	nRF	1.0
ST005	nRF	0.96
ST007	nRF	1.0
ST008	nRF	1.0
ST009	nRF	1.0
ST010	nRF	1.0
ST011	nRF	0.99
ST012	nRF	1.0
MEAN nRF:   0.993
STDEV nRF:  0.0124

About

Codebase that benchmarks alignment-free sequence comparison methods

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0