SNPSnip is a command-line tool with an interactive web interface for filtering VCF files in multiple stages.
NB: this is beta software still in active development, if it breaks or has inaccuracies please let me know
- Python 3.8 or higher
- bcftools 1.18 or higher must be installed and available in your PATH (check with
bcftools --version
, a statically compiled version is available here)
pip install snpsnip
Or from the latest source:
pip install git+https://github.com/gekkonid/snpsnip.git
I recommend using pipx
to install tools like this, to isolate tool dependencies.
snpsnip --output-dir filtered_results --vcf input.vcf.gz \
--maf 0.05 --max-missing 0.1 --min-qual 30
This will conduct three phases of analysis, with a web UI to select threholds etc between these steps.
- Initial Processing: SNPSnip extracts a random subset of SNPs passing
basic filters, and calculates per-sample stats and a sample PCA. You will be
then presented with a web UI where you can set your sample filtering
thresholds to exclude poor samples and (optionally) create subsets of
samples. You can also specify absolute minimum thresholds on MAF,
missingness, and variant quality to subset the variants that are considered,
which is useful if you have a very large number of singleton or poor quality
variants (use
--maf
,--max-missing
,--min-qual
for this). - Variant Filtering: For each sample group, SNPSnip will then calculate variant-level statistics from the random subset of SNPs (from step 1). You can then set your thresholds per-group to exclude poor SNPs.
- Final Filtering: The tool applies your sample and VCF filters to the full VCF file to generate filtered outputs for each group of samples.
Optionally, some predefined groups (e.g. populations, species, etc) can be
provided with the --groups-file
, --group-column
and --sample-column
arguments. These predefined groups can then be refined based on the sample PCA
or simply used verbatim to define subsets of samples.
You can also run this in an "offline" mode, useful for example on clusters
# First, make a subset and calcuate PCA & Sample stats
snpsnip --vcf input.vcf.gz --offline
# This generates a static HTML file you can download & play with to set your
# thresholds. Then, you save a .json file to your PC and then copy that file
# back to wherever you're running SNPsnip, then:
snpsnip --vcf input.vcf.gz --offline --next snpsnip_sample_filters.json
# This makes the subsets, and calculates the SNP stats for each group of
# samples you selected. This again generates a static HTML file you can use to
# interactively make your SNP filtering threshold selections. Again save the
# output, copy it back to where you're running SNPsnip, then:
snpsnip --vcf input.vcf.gz --offline --next snpsnip_variant_filters.json
# This will generate the final files.
This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.