[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target annotation using GTF/GFF3 #311

Closed
micknudsen opened this issue Jan 30, 2018 · 5 comments
Closed

Target annotation using GTF/GFF3 #311

micknudsen opened this issue Jan 30, 2018 · 5 comments

Comments

@micknudsen
Copy link
Contributor

So far I have been using a refFlat file as --annotate input in the target command. However, now I am planning to start using GENODE, and the only available files are GTF and GFF3.

Example: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

According to the target documentation, one should be able to use GTF as input. However, it doesn't seem to work. For example, here is a simple bed file containing only one BRCA1 exon (with exact coordinates as given in the GTF file):

$ cat test.bed
chr17	43124026	43124115

But target does not annotate the interval:

$ cnvkit.py target --annotate gencode.v27.annotation.gtf test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	-

Note that it auto-detects GFF format even though the input is GTF. Output is the same when the GFF3 is used as input.

@etal
Copy link
Owner
etal commented Jan 30, 2018

Thanks for the feedback. GFF/GTF/GFF3 support is pretty simplistic, all handled by the same parser, and relatively untested. On my GFF3 test files the parser picks up gene names correctly, but I'll need to modify it to handle the GTF syntax for specifying gene names and ensure the names are carried over to target labels.

@etal etal added the bug label Jan 30, 2018
@micknudsen
Copy link
Contributor Author

The problem is in this line:

tag = 'Name='

Judging from the links in the Specs section in gff.py, there seems to be no universal agreed upon standard for specifying gene names. For ensmbl it should be tag = 'gene_name='.

Maybe a solution would be to have Name as default but then allow the user to optionally specify a different tag name?

@etal etal added the skgenome label Feb 8, 2018
@micknudsen
Copy link
Contributor Author

I have come up with a possible solution in this branch. Not sure if it fits into the general idea of how your code is structured, but it works (for GFF files):

$ cnvkit.py target --annotate gencode.v27.annotation.gff3 test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	-

$ cnvkit.py target --annotate gencode.v27.annotation.gff3 --tag gene_name test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	BRCA1

@etal
Copy link
Owner
etal commented Feb 15, 2018

Thanks for the example code. Rather than add the --tag option to each command that can accept input regions in BED/GFF/etc. format, I think a safer workflow is to use the skg_convert.py script to convert GFF2/GTF/GFF3 to BED, adding a --gff-tag option there to allow looking at nonstandard tags, or specifying one tag if multiple "standard" tags are present in the file. Then use the resulting BED file with target and other commands.

The command line for that script is:

skg_convert.py -f gff -t bed4 gencode.v27.annotation.gff3 -o gencode.v27.bed
cnvkit.py target --annotate gencode.v27.bed test.bed

With the latest commits to make GFF parsing more permissive, this should work on the gencode files. (It takes about a minute on my machine.) I have not added the --gff-tag option to skg_convert.py yet.

etal added a commit that referenced this issue Feb 21, 2018
)

Options --gff-tag and --gff-type allow filtering a complex GFF for just the
relevant gene/exon annotations and help work around nonstandard tag usage.
Option --refflat-type is equivalent to the --exon feature from refFlat2bed.py,
and --flatten and --merge are copied directly from that script.

A small but important bugfix in skgenome.merge.

Some tweaks for code clarity.
@etal
Copy link
Owner
etal commented Feb 21, 2018

I've added some features to skg_convert.py to help extract the right info from your GFF file. Use it like:

skg_convert.py gencode.v27.annotation.gff3 -f gff -t bed4 -o gencode.v27.bed --gff-type exon --gff-tag gene_name --merge

(Use -h for command line option descriptions, as always.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants