dbAPIS website: https://bcb.unl.edu/dbAPIS
Yan, Y., Zheng, J., Zhang, X., & Yin, Y. (2023). dbAPIS: a database of anti-prokaryotic immune system genes. Nucleic Acids Research. https://doi.org/10.1093/nar/gkad932
- Blast+: compare sequences to database.
- MMseqs2: sequence search and clustering.
- MAFFT: multiple sequence alignment.
- HMMER: sequence analysis using profile HMMs.
- hh-suite: remote protein homology detection suite.
- Foldseek: protein structure comparison.
- DIAMOND: sequence aligner for protein and translated DNA searches.
- Pfam: protein domain family database.
- PHROG: prokaryotic Virus Remote Homologous Groups database.
- ColabFold for AlphaFold2 structure prediction.
- clinker: gene cluster comparison figure generator
- BLASTP homology search: blast_seed.sh
- MMseqs2 clustering/searching: create_family_and_update.sh
- Pfam and PHROG annotation: phrog_pfam_annotation.sh
Download the APIS protein family profile HMMs
wget https://bcb.unl.edu/dbAPIS/downloads/dbAPIS.hmm
prepare a profile database by constructing binary compressed datafiles
hmmpress dbAPIS.hmm
Four files are created: dbAPIS.hmm.h3m, dbAPIS.hmm.h3i, dbAPIS.hmm.h3f, and dbAPIS.hmm.h3p.
Run hmmscan for your amino acid sequences
hmmscan --domtblout hmmscan.out --noali dbAPIS.hmm your_sequence.faa
--domtblout
option produces the space-separated domain hits table. There is one line for each domain. --noali
option is used to omit the alignment section from output and reduce the output volume. More hmmscan information please see hmmer user guide.
Download the APIS protein sequences
wget https://bcb.unl.edu/dbAPIS/downloads/anti_defense.pep
Build diamond database with APIS protein sequences
diamond makedb --in anti_defense.pep -d APIS_db
Run diamond for your amino acid sequences
diamond blastp --db APIS_db -q your_sequence.faa -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen -o diamond.out --max-target-seqs 10000
-f 6
option generates tabular-separated format (a BLAST output format using the option -outfmt 6), which composed of the customized fields. --max-target-seqs
means maximum number of target sequences to report alignments for. More diamond details please see diamond tutorial.
Download the family member mapping table and parser script
wget https://bcb.unl.edu/dbAPIS/downloads/seed_family_mapping.tsv
wget https://bcb.unl.edu/dbAPIS/downloads/parse_annotation_result.sh
Run script to parse annotation output files
bash parse_annotation_result.sh hmmscan.out diamond.out
This will generate parsed output files of hmmscan and diamond respectively
hmmscan.out.parsed.tsv
contains 13 columns:
- Query: query sequence ID
- Query len: query sequence length
- Hit family: hit family ID
- Defense type: hit family inhibited defense system type
- Hit CLAN: hit clan ID
- Hit CLAN defense type: hit clan inferred (predicted) inhibited defense system type
- Family len: length of the target family profile
- Domain c-evalue: the “conditional E-value”, a permissive measure of how reliable this particular domain may be
- Domain score: the bit score for this domain
- Query from: query start position
- Query to: query end position
- HMM from: the start of the MEA alignment of this domain with respect to the profile
- HMM to: the end of the MEA alignment of this domain with respect to the profile
diamond.out.parsed.tsv
contains 12 columns:
- qseqid: query sequence ID
- famid: hit family ID
- Defense type: hit family inhibited defense system type
- Hit CLAN: hit clan ID
- Hit CLAN defense type: hit clan inferred (predicted) inhibited defense system type
- seqid: hit sequence ID
- pident: the percentage of identical amino acid residues that were aligned
- align length: the total length of the alignment, including matching, mismatching and gap positions of query and subject
- evalue: the expected value of the hit
- bitscore: a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar
- qcov: query coverage, the percentage of the query sequence that aligned
- scov: subject coverage, the percentage of the hit sequence that aligned