afproject

Codebase that benchmarks alignment-free sequence comparison methods. The code is used in publicly available webservice AFproject (http://afproject.org).

This repository contains Python scripts to benchmark an alignment-free methods offline.

Requirements

The following software tools are required:

fneighbor from the EMBOSS package
tqdist

The following Python packages are required:

numpy for general calculations
scikit-learn for calculating ROC curves and AUC.
ete3 for calculating Robinson-Foulds distance

Usage

Genome-based phylogeny & horizontal gene transfer

To calculate accuracy with fish_mito dataset from the genome-based phylogeny category for input file in TSV format, just run:

python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.tsv --format tsv --reference fish_mito

You will get the information about normalized Robinson-Foulds (nRF) distance and normalized Quartet Distance (nQD).

nRF:     1.00
RF:      44.0
max RF:  44.0
nQD:     0.6897

You can provide your input files in TSV / PHYLIP / Newick format.

python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.phy --format phylip --reference fish_mito

python benchmark_genome.py --input example_input/genome/std/assembled-fish_mito_random.newick --format phylip --reference fish_mito

Regulatory elements

To calculate performance in the CRM dataset for your input PHYLIP file, just run:

python benchmark_genreg_crm.py --input example_input/genreg/crm_random.phy --format phylip

You will get information about your method's performance across all 7 tissues:

Tissue	k	n	percent
FB	151	300	50.3
FP	127	253	50.2
FT	17	36	47.2
HM	149	300	49.7
HL	17	36	47.2
HH	72	136	52.9
FE	68	136	50.0

Weighted average:	50.21
Standard deviat.:	1.83
Average:         	49.65

Protein sequence classification

To calculate performance in the protein classification category, jus run:

python benchmark_protein.py -i example_input/protein/low-ident/low-ident_random.tsv -f tsv -r low-ident

You will get information about your method's performance across 4 structural levels:

AUC	class	     0.500
AUC	fold	     0.499
AUC	superfamily	 0.504
AUC	family	     0.502
Mean AUC:        0.501
Stdev AUC:       0.002

For protein sequences with high identity:

python benchmark_protein.py -i example_input/protein/high-ident/high-ident_random.tsv -f tsv -r high-ident

Gene Tree Inference

To calculate performance in the gene tree category, jus run:

python benchmark_genetree.py -i example_input/genetree/swisstree_random.tsv -f tsv

You will get information about your method's performance across 4 structural levels:

ST001	nRF	0.97
ST002	nRF	1.0
ST003	nRF	1.0
ST004	nRF	1.0
ST005	nRF	0.96
ST007	nRF	1.0
ST008	nRF	1.0
ST009	nRF	1.0
ST010	nRF	1.0
ST011	nRF	0.99
ST012	nRF	1.0
MEAN nRF:   0.993
STDEV nRF:  0.0124

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
example_input		example_input
libs		libs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_genetree.py		benchmark_genetree.py
benchmark_genome.py		benchmark_genome.py
benchmark_genreg_crm.py		benchmark_genreg_crm.py
benchmark_protein.py		benchmark_protein.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

afproject

Requirements

Usage

Genome-based phylogeny & horizontal gene transfer

Regulatory elements

Protein sequence classification

Gene Tree Inference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

afproject-org/afproject

Folders and files

Latest commit

History

Repository files navigation

afproject

Requirements

Usage

Genome-based phylogeny & horizontal gene transfer

Regulatory elements

Protein sequence classification

Gene Tree Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages