Description
Hi @timkahlke ,
Thanks for the excellent tool!
I have a question regarding the correct usage of BASTA when annotating DIAMOND results against the UniProt TrEMBL database.
I would like to confirm whether I need to reformat my DIAMOND blastp result before using it with BASTA.
I used DIAMOND to perform a blastp search, and here is a sample of my output:
(diamond) [zhangmy@login 21.genecatalog_taxonomy]$ head genecatalog_uniprot_trembl
gene_2 tr|A0A7R7EMU5|A0A7R7EMU5_9FIRM 60.5 124 48 1 1 124 74 196 2.78e-38 142
gene_2 tr|U2DCY8|U2DCY8_9FIRM 46.8 126 64 2 1 124 74 198 1.42e-25 109
gene_2 tr|A0A849BE81|A0A849BE81_9GAMM 44.0 125 69 1 1 124 74 198 2.39e-25 108
gene_2 tr|A0A2L2XRM5|A0A2L2XRM5_9CHRO 45.6 125 67 1 1 124 88 212 4.07e-24 105
gene_2 tr|A0A3E0KUI1|A0A3E0KUI1_9CHRO 45.6 125 67 1 1 124 88 212 4.19e-24 105
gene_2 tr|A0A1B9Y2A3|A0A1B9Y2A3_9FLAO 44.0 125 69 1 1 124 74 198 8.53e-23 101
gene_2 tr|A0A2S6CWN4|A0A2S6CWN4_9CYAN 45.6 125 67 1 1 124 81 205 2.51e-22 100
gene_2 tr|A0A2T6BYK4|A0A2T6BYK4_9FLAO 44.0 125 69 1 1 124 74 198 2.59e-21 97.8
gene_2 tr|A0A2I2M9S7|A0A2I2M9S7_9FLAO 42.9 126 69 2 1 124 74 198 1.67e-20 95.5
gene_2 tr|A0A2A5APR6|A0A2A5APR6_UNCCC 42.9 126 69 2 1 124 76 200 3.68e-20 94.7
In the initial run with BASTA, I received only Unknown results for all sequences (although I stopped the run before it completed).
gene_2 A0A7R7EMU5 60.5 124 48 1 1 124 74 196 2.78e-38 142
gene_2 U2DCY8 46.8 126 64 2 1 124 74 198 1.42e-25 109
gene_2 A0A849BE81 44.0 125 69 1 1 124 74 198 2.39e-25 108
gene_2 A0A2L2XRM5 45.6 125 67 1 1 124 88 212 4.07e-24 105
gene_2 A0A3E0KUI1 45.6 125 67 1 1 124 88 212 4.19e-24 105
gene_2 A0A1B9Y2A3 44.0 125 69 1 1 124 74 198 8.53e-23 101
gene_2 A0A2S6CWN4 45.6 125 67 1 1 124 81 205 2.51e-22 100
gene_2 A0A2T6BYK4 44.0 125 69 1 1 124 74 198 2.59e-21 97.8
gene_2 A0A2I2M9S7 42.9 126 69 2 1 124 74 198 1.67e-20 95.5
gene_2 A0A2A5APR6 42.9 126 69 2 1 124 76 200 3.68e-20 94.7
My questions:
Do I need to reformat the second column (sseqid) of the DIAMOND output, from tr|A0A2L2XRM5|... to just the UniProt accession (e.g., A0A2L2XRM5)?
Is that the reason why I got all Unknown assignments?
My current BASTA script:
##=======Step 1=======##
diamond blastp \
-q /public/home/zhangmy/1.Tibetan_Macaque/04.CD-hit_output/gene_catalog.faa \
-d /public/home/zhangmy/database/uniprot_trembl/uniprot_trembl.dmnd \
-t /public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/tmp \
-p 60 \
-e 1e-5 \
-k 50 \
--id 30 \
--sensitive \
-o /public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl
##=======Step 2=======##
Basta sequence \
-l 25 -i 80 -e 0.00001 -m 3 -b 1 -p 60 \
/public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl \
/public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl.lca.out \
prot \
Thank you very much for your help!
Best regards,