+
+

NAME

+

galah cluster - Cluster FASTA files by average nucleotide identity

+
+
+

SYNOPSIS

+

galah cluster [GENOME_INPUTS] [OUTPUT_ARGUMENTS]

+
+
+

DESCRIPTION

+

This cluster mode dereplicates genomes, choosing a subset of the input genomes as representatives. Required inputs are (1) a genome definition, and (2) an output format definition.

+
+
+

GENOME INPUT

+
+
-f, --genome-fasta-files=PATH ..
+

Path(s) to FASTA files of each genome e.g. 'pathA/genome1.fna pathB/genome2.fa'

+
+
+ +
+
-d, --genome-fasta-directory=PATH
+

Directory containing FASTA files of each genome

+
+
+ +
+
-x, --genome-fasta-extension=EXT
+

File extension of genomes in the directory specified with -d/--genome-fasta-directory \[default: fna\]

+
+
+ +
+
--genome-fasta-list=PATH
+

File containing FASTA file paths, one per line

+
+
+
+
+

FILTERING PARAMETERS

+
+
--checkm-tab-table=PATH
+

CheckM tab table for defining genome quality, which is used both for filtering and to rank genomes during clustering.

+
+
+ +
+
--genome-info=PATH
+

dRep style genome info table for defining quality. Used like --checkm-tab-table.

+
+
+ +
+
--min-completeness=FLOAT
+

Ignore genomes with less completeness than this percentage. \[default: not set\]

+
+
+ +
+
--max-contamination=FLOAT
+

Ignore genomes with more contamination than this percentage. \[default: not set\]

+
+
+
+
+

CLUSTERING PARAMETERS

+
+
--ani=FLOAT
+

Overall ANI level to dereplicate at with FastANI. \[default: 99\]

+
+
+ +
+
--min-aligned-fraction=FLOAT
+

Min aligned fraction of two genomes for clustering. \[default: 50\]

+
+
+ +
+
--fragment-length=FLOAT
+

Length of fragment used in FastANI calculation (i.e. --fragLen). \[default: 3000\]

+
+
+ +
+
--quality-formula=NAME
+

Scoring function for genome quality. 'Parks2020_reduced' for 'completeness-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000' which is reduced from a quality formula described in Parks et. al. 2020 https://doi.org/10.1038/s41587-020-0501-8 'completeness-4contamination' for 'completeness-4*contamination', 'completeness-5contamination' for 'completeness-5*contamination', 'dRep' for 'completeness-5*contamination+contamination*(strain_heterogeneity/100)+0.5*log10(N50)'. \[default: Parks2020\_reduced\]

+
+
+ +
+
--precluster-ani=FLOAT
+

Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. \[default: +95\]

+
+
+ +
+
--precluster-method=NAME
+

method of calculating rough ANI for dereplication. 'dashing' for HyperLogLog, 'finch' for finch MinHash. \[default: dashing\]

+
+
+
+
+

OUTPUT

+
+
--output-cluster-definition=PATH
+

Output a file of representative<TAB>member lines.

+
+
+ +
+
--output-representative-fasta-directory=PATH
+

Symlink representative genomes into this directory.

+
+
+ +
+
--output-representative-fasta-directory-copy=PATH
+

Copy representative genomes into this directory.

+
+
+ +
+
--output-representative-list=PATH
+

Print newline separated list of paths to representatives into this file.

+
+
+
+
+

GENERAL PARAMETERS

+
+
-t, --threads=INT
+

Number of threads. \[default: 1\]

+
+
+ +
+
-v, --verbose
+

Print extra debugging information

+
+
+ +
+
-q, --quiet
+

Unless there is an error, do not print log messages

+
+
+ +
+
-h, --help
+

Output a short usage message.

+
+
+ +
+
--full-help
+

Output a full help message and display in 'man'.

+
+
+ +
+
--full-help-roff
+

Output a full help message in raw ROFF format for conversion to other formats.

+
+
+
+
+

EXIT STATUS

+
+
0
+

Successful program execution.

+
+
+ +
+
1
+

Unsuccessful program execution.

+
+
+ +
+
101
+

The program panicked.

+
+
+
+
+

AUTHOR

+
+
Ben J Woodcroft <benjwoodcroft near gmail.com>
+
+

Source code available at https://github.com/wwood/galah

+
+