A tool for motif annotation and visualization in tandem repeats.
Motifscope is also available online at https://motifscope.holstegelab.eu.
-
To install with conda
cd install/conda sh INSTALL.sh
Conda will install an environment called motifscope, in which the necessary dependencies are installed.
The conda environment is activated by executingconda activate motifscope
in the shell.See the usage section on how to run MotifScope once the conda environment is activated.
-
To install with docker
cd install/docker sh build.sh
Docker will create an image called motifscope, in which the necessary dependencies are installed.
An example command for running motifscope within this docker image is available in run_docker.sh.
Please adapt the options in run_docker.sh to your specific use case.To run it (e.g. with example files in Motifscope/example folder):
sh run_docker.sh path/to/example_sequence.fa path/to/example_population.txt output_prefix
-
For running MotifScope on a set of sequences (reads or assemblies):
motifscope [-i input.fa] [-mink 2] [-maxk 10] [-o output.prefix]
-
To annotate sequences with class labels, one can use the -p option to provide an annotation file.
motifscope [-i input.fa] [-mink 2] [-maxk 10] [-p classes.txt] [-o output.prefix]
The class information will be shown as a separate color-coded column in the figure.
- The header of the sequences in
input.fa
should start with>sample#hap_number#
, for example, for HG002, it could start with>HG002#1#
. - The class annotation file
classes.txt
should be a tab separated file with the first column being the sample ids and the second column being the sample class. E.g. HG002 EUR - The class annotation file can contain a header, which should read
sample <class_name>
. The label of the second column <class_name> can be adapted, and will be shown in the figure. When there is no header, the default is 'population'.
- The header of the sequences in
-
To disable sequence clustering and the dendrogram (e.g. in case of a single sequence), use the -c option:
motifscope -c False [-i input.fa] [-mink 2] [-maxk 10] [-o output.prefix]
-
To run multiple sequence alignment on the compressed representation of the sequence, set
-msa
toPOAMotif
(aligns complete motifs) orPOANucleotide
(aligns nucleotides). -
To guide the algorithm with a set of known motifs, provide the motifs with
-motifs motifs.txt
. The motif filemotifs.txt
should contain the motifs separated with a tab. -
To use random categorical colors for motifs, set
-e
torandom
. To project motifs onto a color scale, set-e
toUMAP
orMDS
for dimension reduction based on motif similarities. -
To characterize motif composition without generating a figure, set
-figure
toFalse
. -
To use the reverse complement of the input fasta, set
-reverse
toTrue
.
- The repeat compositions are output in a fasta file. For example,
>HG002#2#JAHKSD010000034.1:9910981-9913041/rc
G1 A1 G1 C1 A2 G1 A1 C1 T1 C1 T1 G1 T3 C1 A2 AAAAG12 A1 AAAAG1 C1 A1 T1 G1 T2 C1 T1 A3 G1 A1 G1
The motifs are separated by spaces. Each string represents a motif, and the following number indicates how many consecutive copies of that motif occur.
- The motif summary per sequence is output in a tab-separated file. The first column is the sequence header, the second column is the motif, the third column is the amount of sequence covered by the motif, and the fourth column is the count of the motif.