Target annotation using GTF/GFF3 #311

micknudsen · 2018-01-30T12:29:50Z

So far I have been using a refFlat file as --annotate input in the target command. However, now I am planning to start using GENODE, and the only available files are GTF and GFF3.

Example: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

According to the target documentation, one should be able to use GTF as input. However, it doesn't seem to work. For example, here is a simple bed file containing only one BRCA1 exon (with exact coordinates as given in the GTF file):

$ cat test.bed
chr17	43124026	43124115

But target does not annotate the interval:

$ cnvkit.py target --annotate gencode.v27.annotation.gtf test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	-

Note that it auto-detects GFF format even though the input is GTF. Output is the same when the GFF3 is used as input.

The text was updated successfully, but these errors were encountered:

etal · 2018-01-30T17:33:43Z

Thanks for the feedback. GFF/GTF/GFF3 support is pretty simplistic, all handled by the same parser, and relatively untested. On my GFF3 test files the parser picks up gene names correctly, but I'll need to modify it to handle the GTF syntax for specifying gene names and ensure the names are carried over to target labels.

micknudsen · 2018-02-06T09:30:45Z

The problem is in this line:

cnvkit/skgenome/tabio/gff.py

Line 47 in e16caae

tag = 'Name='

Judging from the links in the Specs section in gff.py, there seems to be no universal agreed upon standard for specifying gene names. For ensmbl it should be tag = 'gene_name='.

Maybe a solution would be to have Name as default but then allow the user to optionally specify a different tag name?

micknudsen · 2018-02-11T16:46:17Z

I have come up with a possible solution in this branch. Not sure if it fits into the general idea of how your code is structured, but it works (for GFF files):

$ cnvkit.py target --annotate gencode.v27.annotation.gff3 test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	-

$ cnvkit.py target --annotate gencode.v27.annotation.gff3 --tag gene_name test.bed
Detected file format: bed
Applying annotations as target names
Detected file format: gff
chr17	43124026	43124115	BRCA1

@micknudsen

Thanks @micknudsen!

etal · 2018-02-15T02:36:09Z

Thanks for the example code. Rather than add the --tag option to each command that can accept input regions in BED/GFF/etc. format, I think a safer workflow is to use the skg_convert.py script to convert GFF2/GTF/GFF3 to BED, adding a --gff-tag option there to allow looking at nonstandard tags, or specifying one tag if multiple "standard" tags are present in the file. Then use the resulting BED file with target and other commands.

The command line for that script is:

skg_convert.py -f gff -t bed4 gencode.v27.annotation.gff3 -o gencode.v27.bed
cnvkit.py target --annotate gencode.v27.bed test.bed

With the latest commits to make GFF parsing more permissive, this should work on the gencode files. (It takes about a minute on my machine.) I have not added the --gff-tag option to skg_convert.py yet.

) Options --gff-tag and --gff-type allow filtering a complex GFF for just the relevant gene/exon annotations and help work around nonstandard tag usage. Option --refflat-type is equivalent to the --exon feature from refFlat2bed.py, and --flatten and --merge are copied directly from that script. A small but important bugfix in skgenome.merge. Some tweaks for code clarity.

etal · 2018-02-21T22:34:55Z

I've added some features to skg_convert.py to help extract the right info from your GFF file. Use it like:

skg_convert.py gencode.v27.annotation.gff3 -f gff -t bed4 -o gencode.v27.bed --gff-type exon --gff-tag gene_name --merge

(Use -h for command line option descriptions, as always.)

etal added the bug label Jan 30, 2018

etal added the skgenome label Feb 8, 2018

etal added a commit that referenced this issue Feb 15, 2018

tabio/gff: Handle nonstandard GFF attributes more safely (#311)

699cb4d

Thanks @micknudsen!

etal added a commit that referenced this issue Feb 15, 2018

skg_convert.py: Add stub options for GFF parsing (#311)

35f6313

etal closed this as completed Feb 21, 2018

etal added a commit that referenced this issue Feb 23, 2018

doc: Suggest preprocessing the input regions for guess_baits.py (#311)

b9b6179

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target annotation using GTF/GFF3 #311

Target annotation using GTF/GFF3 #311

Target annotation using GTF/GFF3 #311

Target annotation using GTF/GFF3 #311

Comments