10000 All Unknown results when using DIAMOND output with UniProt IDs like tr|...|... – should sseqid be cleaned? · Issue #50 · timkahlke/BASTA · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
All Unknown results when using DIAMOND output with UniProt IDs like tr|...|... – should sseqid be cleaned? #50
Open
@ZhangMYioz

Description

@ZhangMYioz

Hi @timkahlke ,

Thanks for the excellent tool!

I have a question regarding the correct usage of BASTA when annotating DIAMOND results against the UniProt TrEMBL database.

I would like to confirm whether I need to reformat my DIAMOND blastp result before using it with BASTA.
I used DIAMOND to perform a blastp search, and here is a sample of my output:

(diamond) [zhangmy@login 21.genecatalog_taxonomy]$ head genecatalog_uniprot_trembl
gene_2	tr|A0A7R7EMU5|A0A7R7EMU5_9FIRM	60.5	124	48	1	1	124	74	196	2.78e-38	142
gene_2	tr|U2DCY8|U2DCY8_9FIRM	46.8	126	64	2	1	124	74	198	1.42e-25	109
gene_2	tr|A0A849BE81|A0A849BE81_9GAMM	44.0	125	69	1	1	124	74	198	2.39e-25	108
gene_2	tr|A0A2L2XRM5|A0A2L2XRM5_9CHRO	45.6	125	67	1	1	124	88	212	4.07e-24	105
gene_2	tr|A0A3E0KUI1|A0A3E0KUI1_9CHRO	45.6	125	67	1	1	124	88	212	4.19e-24	105
gene_2	tr|A0A1B9Y2A3|A0A1B9Y2A3_9FLAO	44.0	125	69	1	1	124	74	198	8.53e-23	101
gene_2	tr|A0A2S6CWN4|A0A2S6CWN4_9CYAN	45.6	125	67	1	1	124	81	205	2.51e-22	100
gene_2	tr|A0A2T6BYK4|A0A2T6BYK4_9FLAO	44.0	125	69	1	1	124	74	198	2.59e-21	97.8
gene_2	tr|A0A2I2M9S7|A0A2I2M9S7_9FLAO	42.9	126	69	2	1	124	74	198	1.67e-20	95.5
gene_2	tr|A0A2A5APR6|A0A2A5APR6_UNCCC	42.9	126	69	2	1	124	76	200	3.68e-20	94.7

In the initial run with BASTA, I received only Unknown results for all sequences (although I stopped the run before it completed).

gene_2	A0A7R7EMU5	60.5	124	48	1	1	124	74	196	2.78e-38	142
gene_2	U2DCY8	46.8	126	64	2	1	124	74	198	1.42e-25	109
gene_2	A0A849BE81	44.0	125	69	1	1	124	74	198	2.39e-25	108
gene_2	A0A2L2XRM5	45.6	125	67	1	1	124	88	212	4.07e-24	105
gene_2	A0A3E0KUI1	45.6	125	67	1	1	124	88	212	4.19e-24	105
gene_2	A0A1B9Y2A3	44.0	125	69	1	1	124	74	198	8.53e-23	101
gene_2	A0A2S6CWN4	45.6	125	67	1	1	124	81	205	2.51e-22	100
gene_2	A0A2T6BYK4	44.0	125	69	1	1	124	74	198	2.59e-21	97.8
gene_2	A0A2I2M9S7	42.9	126	69	2	1	124	74	198	1.67e-20	95.5
gene_2	A0A2A5APR6	42.9	126	69	2	1	124	76	200	3.68e-20	94.7

My questions:
Do I need to reformat the second column (sseqid) of the DIAMOND output, from tr|A0A2L2XRM5|... to just the UniProt accession (e.g., A0A2L2XRM5)?

Is that the reason why I got all Unknown assignments?

My current BASTA script:

##=======Step 1=======##
diamond blastp \
       -q /public/home/zhangmy/1.Tibetan_Macaque/04.CD-hit_output/gene_catalog.faa \
       -d /public/home/zhangmy/database/uniprot_trembl/uniprot_trembl.dmnd \
       -t /public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/tmp \
       -p 60 \
       -e 1e-5 \
       -k 50 \
       --id 30 \
       --sensitive \
       -o /public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl

##=======Step 2=======##
Basta sequence \
-l 25 -i 80 -e 0.00001 -m 3 -b 1 -p 60 \
/public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl \
/public/home/zhangmy/1.Tibetan_Macaque/21.genecatalog_taxonomy/genecatalog_uniprot_trembl.lca.out \
prot \

Thank you very much for your help!

Best regards,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0