This readme accompanies the paper "LOFTK: a framework for fully automated calculation of predicted Loss-of-Function variants." by Alasiri A. et al. bioRxiv 2021.
Predicted Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Here we present an open source tool, the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from both genotyped and sequenced genomes, identifying genes that are inactive in one or two copies, and providing summary statistics for downstream analyses.
LoFTK is a pipeline written in the BASH
and Perl
languages to identify loss-of function (LoF) variants using VEP
and LOFTEE
efficiently. It will aid in annotating LoF variants, select high confidence (HC) variants, state the homozygous and heterozygous LoF variants, and calculate statistics.
The Loss-of-Function ToolKit Workflow: finding knockouts using genotyped and sequenced genomes.
LoFTK has been developed to work under the environment of two cluster managers; Simple Linux Utility for Resource Management (SLURM) and Sun Grid Engine (SGE). Each cluster manager (SLURM/SGE) has LoFTK verison for installation. Look at Instillation and Requirements in the wiki.
All scripts are annotated for debugging purposes - and future reference. The scripts will work within the context of a certain Linux environment - in this case we have tested LoFTK on CentOS7 with a SLURM Grid Engine background.
Perl >= 5.10.1
Bash
- Ensembl Variant Effect Predictor (VEP)
LOFTEE
for GRCh37- Ancestral sequence
(human_ancestor.fa[.gz|.rz])
- PhyloCSF database
(phylocsf.sql)
for conservation filters
- Ancestral sequence
LOFTEE
for GRCh38- GERP scores bigwig
(gerp_bigwig)
- Ancestral sequence
(human_ancestor_fa)
- PhyloCSF database
(loftee.sql.gz)
- GERP scores bigwig
samtools
(must be on path)
The only script the user should use is the run_loftk.sh
script in conjunction with a configuration file LoF.config
. It is required to set up the configuration file LoF.config
before run any analysis, follow the instruction in the wiki.
You can run LoFTK using the following command:
bash run_loftk.sh $(pwd)/LoF.config
Always Remember
- To set all options in the
LoF.config
file before the run - To use the full path to the configuration file, e.g. use
$(pwd)
. - You can run LoFTK steps all in one run or separately by setting analysis type in the
LoF.config
file. - VEP and LOFTEE options can be added and modified in one of these configuration files in
./bin/
:
File | Description | Usage |
---|---|---|
README.md | Description of project | Human editable |
LICENSE | User permissions | Read only |
LoF.config | Configuration file | Human editable |
run_loftk.sh | Main LoFTK script | Read only |
LoF_annotation.sh | Annotation of LoF variants/genes | Read only |
allele_to_vcf.sh | Converting IMPUT2 format to VCF | Read only |
descriptive_stat.sh | Descriptive analysis | Read only |
This scripts allows you to merge the counts files of different cohorts. By default it only includes genes that were present in both files but you can use the union
function to include genes that are present in at least 1 cohort. This means that for the other cohorts, the gene LoF counts will be set to 0 for every individual (which is tricky if the gene was not tested), or to a self-specified value
perl merge_gene_lof_counts.pl -i cohortX.counts,cohortY.counts,cohortZ.counts -o merged_cohorts.counts -c