[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3107411.3107425acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Understanding Sequence Conservation With Deep Learning

Published: 20 August 2017 Publication History

Abstract

Comparative genomics has been a powerful tool for identifying functional elements in the human genome. Millions of conserved elements have been discovered. However, understanding the functional roles of these elements still remain a challenge, especially in noncoding regions. In particular, it is still unclear why these elements are evolutionarily conserved and what kind of functional elements are encoded within these sequences. We present a deep learning framework, DeepCons, to further understand potential functional elements within conserved sequences. DeepCons is a convolutional neural net (CNN) that receives a short segment of DNA sequence as input and outputs the probability of the sequence of being evolutionary conserved. The CNN model utilizes hundreds of convolution kernels, which are analogous to sequence motifs, to extract features from DNA sequences during the training pro- cess. First, we train the model to discriminate 887,577 conserved elements from a matched number of nonconserved elements in the human genome. Then, we use visualization techniques to interpret how the model discriminates between the two classes of sequences, which provides indirect clues to the functional roles of conserved elements. Some kernels significantly match well-known regulatory motifs corresponding to transcription factors. Many kernels show positional biases relative to transcription start sites or transcription end sites. Most of the kernels do not correspond to any known functional element, suggesting that they might represent unknown categories of functional elements. We also utilize Deep- Cons to annotate how changes at individual nucleotides impact the conservation properties of the surrounding sequences, thereby providing an annotation of conserved sequences at an individual nucleotide level. The source code of DeepCons is publicly available at https://github.com/uci-cbcl/DeepCons.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. 2015. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology (2015).
[3]
Timothy L Bailey, Mikael Boden, Fabian A Buske, Martin Frith, Charles E Grant, Luca Clementi, Jingyuan Ren, Wilfred W Li, and William S Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic acids research (2009), gkp335.
[4]
Timothy L Bailey and Philip Machanick. 2012. Inferring direct DNA binding from ChIP-seq. Nucleic acids research 40, 17 (2012), e128--e128.
[5]
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proc. 9th Python in Science Conf. 1--7.
[6]
James Bergstra, Daniel Yamins, and David D Cox. 2013. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. ICML (1) 28 (2013), 115--123.
[7]
Yifei Chen, Yi Li, Rajiv Narayan, Aravind Subramanian, and Xiaohui Xie. 2016. Gene expression inference with deep learning. Bioinformatics (2016), btw074.
[8]
Eugene V Davydov, David L Goode, Marina Sirota, Gregory M Cooper, Arend Sidow, and Serafim Batzoglou. 2010. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6, 12 (2010), e1001025.
[9]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121--2159.
[10]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Aistats, Vol. 9. 249--256.
[11]
Shobhit Gupta, John A Stamatoyannopoulos, Timothy L Bailey, and William S Noble. 2007. Quantifying similarity between motifs. Genome biology 8, 2 (2007), R24.
[12]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, and others. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82--97.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[14]
Donna Karolchik, Angela S Hinrichs, Terrence S Furey, Krishna M Roskin, Charles W Sugnet, David Haussler, and W James Kent. 2004. The UCSC Table Browser data retrieval tool. Nucleic acids research 32, suppl 1 (2004), D493--D496.
[15]
Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 (2015).
[16]
David R Kelley, Jasper Snoek, and John L Rinn. 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research 26, 7 (2016), 990--999.
[17]
Ana Kozomara and Sam Griffiths-Jones. 2014. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic acids research 42, D1 (2014), D68--D73.
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[19]
Yann LeCun and Yoshua Bengio. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 10 (1995), 1995.
[20]
Anthony Mathelier, Oriol Fornes, David J. Arenillas, Chih-yu Chen, Grégoire Denay, Jessica Lee, Wenqiang Shi, Casper Shyr, Ge Tan, Rebecca Worsley-Hunt, Allen W. Zhang, François Parcy, Boris Lenhard, Albin Sandelin, and Wyeth W. Wasserman. 2016. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Research 44, D1 (2016), D110--D115. arXiv: http://nar.oxfordjournals.org/content/44/D1/D110.full.pdf+html
[21]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.
[22]
Kim D Pruitt, Garth R Brown, Susan M Hiatt, Françoise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart, Melissa J Landrum, Kelly M McGarvey, and others. 2014. RefSeq: an update on mammalian reference sequences. Nucleic acids research 42, D1 (2014), D756--D763.
[23]
Daniel Quang, Yifei Chen, and Xiaohui Xie. 2014. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics (2014), btu703.
[24]
Daniel Quang and Xiaohui Xie. 2016. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic acids research (2016), gkw226.
[25]
Aaron R Quinlan and Ira M Hall. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 6 (2010), 841--842.
[26]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.
[27]
Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. 2016. Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. arXiv preprint arXiv:1605.01713 (2016).
[28]
Adam Siepel, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W Hillier, Stephen Richards, and others. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research 15, 8 (2005), 1034--1050.
[29]
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
[30]
Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11). 129--136.
[31]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[32]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.
[33]
Xiaohui Xie, Jun Lu, EJ Kulbokas, Todd R Golub, Vamsi Mootha, Kerstin Lindblad-Toh, Eric S Lander, and Manolis Kellis. 2005. Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434, 7031 (2005), 338--345.
[34]
Xiaohui Xie, Tarjei S Mikkelsen, Andreas Gnirke, Kerstin Lindblad-Toh, Manolis Kellis, and Eric S Lander. 2007. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proceedings of the National Academy of Sciences 104, 17 (2007), 7145--7150.
[35]
J Omar Yáñez-Cuna, Cosmas D Arnold, Gerald Stampfel, Łukasz M Boryń, Daniel Gerlach, Martina Rath, and Alexander Stark. 2014. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome research 24, 7 (2014), 1147--1156.
[36]
Jian Zhou and Olga G Troyanskaya. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nature methods 12, 10 (2015), 931--934.

Cited By

View all
  • (2020)Effect of sequence padding on the performance of deep learning models in archaeal protein functional predictionScientific Reports10.1038/s41598-020-71450-810:1Online publication date: 3-Sep-2020
  • (2020)Identification and characterization of constrained non-exonic bases lacking predictive epigenomic and transcription factor binding annotationsNature Communications10.1038/s41467-020-19962-911:1Online publication date: 2-Dec-2020
  • (2020)Identifying viruses from metagenomic data using deep learningQuantitative Biology10.1007/s40484-019-0187-48:1(64-77)Online publication date: Mar-2020
  • Show More Cited By

Index Terms

  1. Understanding Sequence Conservation With Deep Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
    August 2017
    800 pages
    ISBN:9781450347228
    DOI:10.1145/3107411
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. conserved elements
    2. deep learning
    3. regulatory motifs

    Qualifiers

    • Research-article

    Funding Sources

    • NIH

    Conference

    BCB '17
    Sponsor:

    Acceptance Rates

    ACM-BCB '17 Paper Acceptance Rate 42 of 132 submissions, 32%;
    Overall Acceptance Rate 254 of 885 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Effect of sequence padding on the performance of deep learning models in archaeal protein functional predictionScientific Reports10.1038/s41598-020-71450-810:1Online publication date: 3-Sep-2020
    • (2020)Identification and characterization of constrained non-exonic bases lacking predictive epigenomic and transcription factor binding annotationsNature Communications10.1038/s41467-020-19962-911:1Online publication date: 2-Dec-2020
    • (2020)Identifying viruses from metagenomic data using deep learningQuantitative Biology10.1007/s40484-019-0187-48:1(64-77)Online publication date: Mar-2020
    • (2019)Convolutional Classification of Pathogenicity in H5 Avian Influenza Strains2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)10.1109/ICMLA.2019.00259(1570-1577)Online publication date: Dec-2019
    • (2018)Deep Learning in Biomedical Data ScienceAnnual Review of Biomedical Data Science10.1146/annurev-biodatasci-080917-0133431:1(181-205)Online publication date: 20-Jul-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media