[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dbAPIS is a database of anti-prokaryotic immune system proteins. The repository contains the codes and scripts used to generate and maintain the database.

Notifications You must be signed in to change notification settings

azureycy/dbAPIS

Repository files navigation

Yan, Y., Zheng, J., Zhang, X., & Yin, Y. (2023). dbAPIS: a database of anti-prokaryotic immune system genes. Nucleic Acids Research. https://doi.org/10.1093/nar/gkad932

Tools and databases

  • Blast+: compare sequences to database.
  • MMseqs2: sequence search and clustering.
  • MAFFT: multiple sequence alignment.
  • HMMER: sequence analysis using profile HMMs.
  • hh-suite: remote protein homology detection suite.
  • Foldseek: protein structure comparison.
  • DIAMOND: sequence aligner for protein and translated DNA searches.
  • Pfam: protein domain family database.
  • PHROG: prokaryotic Virus Remote Homologous Groups database.
  • ColabFold for AlphaFold2 structure prediction.
  • clinker: gene cluster comparison figure generator

Database content processing

Create APIS protein families and add newly curated proteins

Build APIS protein family HMMs

Searching homologous families using HHsearch

Protein function annotation

Protein structure prediction

Searching protein structure homologs using Foldseek

Genomic context visualization using jbrowse

Gene cluster comparison using clinker

Run APIS protein annotation with DIAMOND and HMMscan locally

Run HMMscan on your local server

Download the APIS protein family profile HMMs

wget https://bcb.unl.edu/dbAPIS/downloads/dbAPIS.hmm

prepare a profile database by constructing binary compressed datafiles

hmmpress dbAPIS.hmm

Four files are created: dbAPIS.hmm.h3m, dbAPIS.hmm.h3i, dbAPIS.hmm.h3f, and dbAPIS.hmm.h3p.

Run hmmscan for your amino acid sequences

hmmscan --domtblout hmmscan.out --noali dbAPIS.hmm your_sequence.faa

--domtblout option produces the space-separated domain hits table. There is one line for each domain. --noali option is used to omit the alignment section from output and reduce the output volume. More hmmscan information please see hmmer user guide.

Run DIAMOND on your local server

Download the APIS protein sequences

wget https://bcb.unl.edu/dbAPIS/downloads/anti_defense.pep

Build diamond database with APIS protein sequences

diamond makedb --in anti_defense.pep -d APIS_db

Run diamond for your amino acid sequences

diamond blastp --db APIS_db -q your_sequence.faa -f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen -o diamond.out --max-target-seqs 10000

-f 6 option generates tabular-separated format (a BLAST output format using the option -outfmt 6), which composed of the customized fields. --max-target-seqs means maximum number of target sequences to report alignments for. More diamond details please see diamond tutorial.

Parse annotation output

Download the family member mapping table and parser script

wget https://bcb.unl.edu/dbAPIS/downloads/seed_family_mapping.tsv
wget https://bcb.unl.edu/dbAPIS/downloads/parse_annotation_result.sh

Run script to parse annotation output files

bash parse_annotation_result.sh hmmscan.out diamond.out

This will generate parsed output files of hmmscan and diamond respectively

  • hmmscan.out.parsed.tsv contains 13 columns:
  1. Query: query sequence ID
  2. Query len: query sequence length
  3. Hit family: hit family ID
  4. Defense type: hit family inhibited defense system type
  5. Hit CLAN: hit clan ID
  6. Hit CLAN defense type: hit clan inferred (predicted) inhibited defense system type
  7. Family len: length of the target family profile
  8. Domain c-evalue: the “conditional E-value”, a permissive measure of how reliable this particular domain may be
  9. Domain score: the bit score for this domain
  10. Query from: query start position
  11. Query to: query end position
  12. HMM from: the start of the MEA alignment of this domain with respect to the profile
  13. HMM to: the end of the MEA alignment of this domain with respect to the profile
  • diamond.out.parsed.tsv contains 12 columns:
  1. qseqid: query sequence ID
  2. famid: hit family ID
  3. Defense type: hit family inhibited defense system type
  4. Hit CLAN: hit clan ID
  5. Hit CLAN defense type: hit clan inferred (predicted) inhibited defense system type
  6. seqid: hit sequence ID
  7. pident: the percentage of identical amino acid residues that were aligned
  8. align length: the total length of the alignment, including matching, mismatching and gap positions of query and subject
  9. evalue: the expected value of the hit
  10. bitscore: a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar
  11. qcov: query coverage, the percentage of the query sequence that aligned
  12. scov: subject coverage, the percentage of the hit sequence that aligned

About

dbAPIS is a database of anti-prokaryotic immune system proteins. The repository contains the codes and scripts used to generate and maintain the database.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages