research-article

Understanding Sequence Conservation With Deep Learning

Authors:

ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

Pages 400 - 406

https://doi.org/10.1145/3107411.3107425

Published: 20 August 2017 Publication History

Get Access

Abstract

Comparative genomics has been a powerful tool for identifying functional elements in the human genome. Millions of conserved elements have been discovered. However, understanding the functional roles of these elements still remain a challenge, especially in noncoding regions. In particular, it is still unclear why these elements are evolutionarily conserved and what kind of functional elements are encoded within these sequences. We present a deep learning framework, DeepCons, to further understand potential functional elements within conserved sequences. DeepCons is a convolutional neural net (CNN) that receives a short segment of DNA sequence as input and outputs the probability of the sequence of being evolutionary conserved. The CNN model utilizes hundreds of convolution kernels, which are analogous to sequence motifs, to extract features from DNA sequences during the training pro- cess. First, we train the model to discriminate 887,577 conserved elements from a matched number of nonconserved elements in the human genome. Then, we use visualization techniques to interpret how the model discriminates between the two classes of sequences, which provides indirect clues to the functional roles of conserved elements. Some kernels significantly match well-known regulatory motifs corresponding to transcription factors. Many kernels show positional biases relative to transcription start sites or transcription end sites. Most of the kernels do not correspond to any known functional element, suggesting that they might represent unknown categories of functional elements. We also utilize Deep- Cons to annotate how changes at individual nucleotides impact the conservation properties of the surrounding sequences, thereby providing an annotation of conserved sequences at an individual nucleotide level. The source code of DeepCons is publicly available at https://github.com/uci-cbcl/DeepCons.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

Abstract

References

Cited By

Index Terms

Recommendations

Cytochrome Oxidase I COI sequence conservation and variation patterns in the yellowfin and longtail tunas

Deep Learning Approach for Pathogen Detection Through Shotgun Metagenomics Sequence Classification

Detecting Protein-Domains DNA-Motifs Association in Saccharomyces cerevisiae Regulatory Networks

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations