Review Article
Published: 07 May 2015

Machine learning applications in genetics and genomics

Maxwell W. Libbrecht¹ &
William Stafford Noble^1,2

Nature Reviews Genetics volume 16, pages 321–332 (2015)Cite this article

68k Accesses
1092 Citations
172 Altmetric
Metrics details

Subjects

Key Points

The field of machine learning includes the development and application of computer algorithms that improve with experience.
Machine learning methods can be divided into supervised, semi-supervised and unsupervised methods. Supervised methods are trained on examples with labels (for example, 'gene' or 'not gene') and are then used to predict these labels on other examples, whereas unsupervised methods find patterns in data sets without the use of labels. Semi-supervised methods combine these two approaches, leveraging patterns in unlabelled data to improve power in the prediction of labels.
Different machine learning methods may be required for an application, depending on whether one is interested in interpreting the output model or is simply concerned with predictive power. Generative models, which posit a probabilistic distribution over input data, are generally best for interpretability, whereas discriminative models, which seek only to model labels, are generally best for predictive power.
Prior information can be added to a model in order to train the model more effectively when it is provided with limited data, to limit the complexity of the model or to incorporate data that are not used by the model directly. Prior information can be incorporated explicitly in a probabilistic model or implicitly through the choice of features or similarity measures.
The choice of an appropriate performance measure depends strongly on the application task. Machine learning methods are most effective when they optimize an appropriate performance measure.
Network estimation methods are appropriate when the data contain complex dependencies among examples. These methods work best when they take into account the confounding effects of indirect relationships.

Abstract

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: A canonical example of a machine learning application.**

**Figure 3: Two models of transcription factor binding.**

**Figure 4: Incorporating a probabilistic prior into a position-specific frequency matrix.**

**Figure 5: Three ways to accommodate heterogeneous data in machine learning.**

**Figure 6: Inferring network structure.**

References

Mitchell, T. Machine Learning (McGraw-Hill, 1997). This book provides a general introduction to machine learning that is suitable for undergraduate or graduate students.
Google Scholar
Ohler, W., Liao, C., Niemann, H. & Rubin, G. M. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, RESEARCH0087 (2002).
Article PubMed PubMed Central Google Scholar
Degroeve, S., Baets, B. D., de Peer, Y. V. & Rouzé, P. Feature subset selection for splice site prediction. Bioinformatics 18, S75–S83 (2002).
Article PubMed Google Scholar
Bucher, P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 4, 563–578 (1990).
Article Google Scholar
Heintzman, N. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311–318 (2007).
Article CAS PubMed Google Scholar
Segal, E. et al. A genomic code for nucleosome positioning. Nature 44, 772–778 (2006).
Article Google Scholar
Picardi, E. & Pesole, G. Computational methods for ab initio and comparative gene finding. Methods Mol. Biol. 609, 269–284 (2010).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Article CAS PubMed Google Scholar
Fraser, A. G. & Marcotte, E. M. A probabilistic view of gene function. Nature Genet. 36, 559–564 (2004).
Article CAS PubMed Google Scholar
Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).
Article CAS PubMed Google Scholar
Karlic, R. R. Chung, H., Lasserre, J., Vlahovicek, K. & Vingron, M. Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA 107, 2926–2931 (2010).
Article CAS PubMed Google Scholar
Ouyang, Z., Zhou, Q. & Wong, H. W. ChIP–seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA 106, 21521–21526 (2009).
Article CAS PubMed Google Scholar
Friedman, N. Inferring cellular networks using probabilistic graphical models. Science 303, 799–805 (2004).
Article CAS PubMed Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2001). This book provides an overview of machine learning that is suitable for students with a strong background in statistics.
Book Google Scholar
Hamelryck, T. Probabilistic models and machine learning in structural bioinformatics. Stat. Methods Med. Res. 18, 505–526 (2009).
Article PubMed Google Scholar
Swan, A. L., Mobasheri, A., Allaway, D., Liddell, S. & Bacardit, J. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595–610 (2013).
Article CAS PubMed PubMed Central Google Scholar
Upstill-Goddard, R., Eccles, D., Fliege, J. & Collins, A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief. Bioinform. 14, 251–260 (2013).
Article CAS PubMed Google Scholar
Yip, K. Y., Cheng, C. & Gerstein, M. Machine learning and genome annotation: a match meant to be? Genome Biol. 14, 205 (2013).
Article PubMed PubMed Central Google Scholar
Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. & Noble, W. S. Unsupervised segmentation of continuous genomic data. Bioinformatics 23, 1424–1426 (2007).
Article CAS PubMed Google Scholar
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods 9, 215–216 (2012). This study applies an unsupervised hidden Markov model algorithm to analyse genomic assays such as ChIP–seq and DNase-seq in order to identify new classes of functional elements and new instances of existing functional element types.
Article CAS PubMed PubMed Central Google Scholar
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods 9, 473–476 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chapelle, O., Schölkopf, B. & Zien, A. (eds) Semi-supervised Learning (MIT Press, 2006).
Book Google Scholar
Stamatoyannopoulos, J. A. Illuminating eukaryotic transcription start sites. Nature Methods 7, 501–503 (2010).
Article CAS PubMed Google Scholar
Boser, B. E., Guyon, I. M. & Vapnik, V. N. in A Training Algorithm for Optimal Margin Classifiers (ed. Haussler, D.) 144–152 (ACM Press, 1992). This paper was the first to describe the SVM, a type of discriminative classification algorithm.
Google Scholar
Noble, W. S. What is a support vector machine? Nature Biotech. 24, 1565–1567 (2006). This paper describes a non-mathematical introduction to SVMs and their applications to life science research.
Article CAS Google Scholar
Ng, A. Y. & Jordan, M. I. Advances in Neural Information Processing Systems (eds Dietterich, T. et al.) (MIT Press, 2002).
Google Scholar
Jordan, M. I. Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Technical Report 9503 [online], (1995).
Google Scholar
Wolpert, D. H. & Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997). This paper provides a mathematical proof that no single machine learning method can perform best on all possible learning problems.
Article Google Scholar
Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 (2012).
Article CAS PubMed PubMed Central Google Scholar
Urbanowicz, R. J., Granizo-Mackenzie, D. & Moore, J. H. in Proceedings of the Parallel Problem Solving From Nature 266–275 (Springer, 2012).
Book Google Scholar
Brown, M. et al. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (ed. Rawlings, C.) 47–55 (AAAI Press, 1993).
Google Scholar
Bailey, T. L. & Elkan, C. P. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (eds Rawlings, C. et al.) 21–29 (AAAI Press, 1995).
Google Scholar
Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002).
Google Scholar
Leslie, C. et al. (eds) Proceedings of the Pacific Symposium on Biocomputing (World Scientific, 2002).
Google Scholar
Rätsch, G. & Sonnenburg, S. in Kernel Methods in Computational Biology (eds Schölkopf, B. et al.) 277–298 (MIT Press, 2004).
Google Scholar
Zien, A. et al. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000).
Article CAS PubMed Google Scholar
Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7, 246 (2006).
Article PubMed PubMed Central Google Scholar
Jaakkola, T. & Haussler, D. Advances in Neural Information Processing Systems 11 (Morgan Kauffmann, 1998).
Google Scholar
Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis (Cambridge Univ. Press, 2004). This textbook describes kernel methods, including a detailed mathematical treatment that is suitable for quantitatively inclined graduate students.
Book Google Scholar
Peña-Castillo, L. et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol. 9, S2 (2008).
Article PubMed PubMed Central Google Scholar
Sonnhammer, E., Eddy, S. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).
Article CAS PubMed Google Scholar
Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).
Article CAS PubMed PubMed Central Google Scholar
Pavlidis, P., Weston, J., Cai, J. & Noble, W. S. Learning gene functional classifications from multiple data types. J. Computat. Biol. 9, 401–411 (2002).
Article CAS Google Scholar
Lanckriet, G. R. G., Bie, T. D., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics 20, 2626–2635 (2004).
Article CAS PubMed Google Scholar
Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).
Article CAS PubMed Google Scholar
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1998). This textbook on probability models for machine learning is suitable for undergraduates or graduate students.
Google Scholar
Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protoc. 2, pdb.prot5384 (2010).
Google Scholar
Wasson, T. & Hartemink, A. J. An ensemble model of competitive multi-factor binding of the genome. Genome Res. 19, 2102–2112 (2009).
Article Google Scholar
Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455 (2011).
Article CAS PubMed PubMed Central Google Scholar
Cuellar-Partida, G. et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 56–62 (2011).
Article PubMed PubMed Central Google Scholar
Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA 98, 15149–15154 (2001).
Article CAS PubMed Google Scholar
Glaab, E., Bacardit, J., Garibaldi, J. M. & Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE 7, e39932 (2012).
Article CAS PubMed PubMed Central Google Scholar
Tibshirani, R. J. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–288 (1996). This paper was the first to describe the technique known as lasso (or L 1 regularization), which performs feature selection in conjunction with learning.
Google Scholar
Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput. Intell. Mag. 7, 35–45 (2012).
Article PubMed PubMed Central Google Scholar
Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943). This paper was the first to describe the now-ubiquitous method known as L 2 regularization or ridge regression.
Google Scholar
Keogh, E. & Mueen, A. Encyclopedia of Machine Learning (Springer, 2011).
Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).
Google Scholar
Davis, J. & Goadrich, M. Proceedings of the International Conference on Machine Learning (ACM, 2006). This paper provides a succinct introduction to precision-recall and receiver operating characteristic curves, and details under which scenarios these approaches should be used.
Google Scholar
Cohen, J. Weighted κ: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213 (1968).
Article CAS PubMed Google Scholar
Luengo, J., García, S. & Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 32, 77–108 (2012).
Article Google Scholar
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001). This study uses an imputation-based approach to handle missing values in microarray data. The method was widely used in subsequent studies to address this common problem.
Article CAS PubMed Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet. 46, 310–315 (2014). This study uses a machine learning approach to estimate the pathogenicity of genetic variants using a framework that takes advantage of the fact that natural selection removes deleterious variation.
Article CAS PubMed Google Scholar
Qiu, J. & Noble, W. S. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput. Biol. 4, e1000054 (2008).
Article PubMed PubMed Central Google Scholar
Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601–620 (2000).
Article CAS PubMed Google Scholar
Bacardit, J. & Llorà, X. Large-scale data mining using genetics-based machine learning. Wiley Interdiscip. Rev. 3, 37–61 (2013).
Google Scholar
Koski, T. J. & Noble, J. A review of Bayesian networks and structure learning. Math. Applicanda 40, 51–103 (2012).
Google Scholar
Pearl, J. Causality: Models, Reasoning and Inference (Cambridge Univ. Press, 2000).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, 98195–2350, Washington, USA
Maxwell W. Libbrecht & William Stafford Noble
Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, 98195–5065, Washington, USA
William Stafford Noble

Authors

Maxwell W. Libbrecht
View author publications
You can also search for this author in PubMed Google Scholar
William Stafford Noble
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William Stafford Noble.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Machine learning: A field concerned with the development and application of computer algorithms that improve with experience.
Artificial intelligence: A field concerned with the development of computer algorithms that replicate human skills, including learning, visual perception and natural language understanding.
Heterogeneous data sets: A collection of data sets from multiple sources or experimental methodologies. Artefactual differences between data sets can confound analysis.
Likelihood: The probability of a data set given a particular model.
Label: The target of a prediction task. In classification, the label is discrete (for example, 'expressed' or 'not expressed'); in regression, the label is of real value (for example, a gene expression value).
Examples: Data instances used in a machine learning task.
Supervised learning: Machine learning based on an algorithm that is trained on labelled examples and used to predict the label of unlabelled examples.
Unsupervised learning: Machine learning based on an algorithm that does not require labels, such as a clustering algorithm.
Semi-supervised learning: A machine-learning method that requires labels but that also makes use of unlabelled examples.
Prediction accuracy: The fraction of predictions that are correct. It is calculated by dividing the number of correct predictions by the total number of predictions.
Generative models: Machine learning models that build a full model of the distribution of features.
Discriminative models: Machine learning approaches that model only the distribution of a label when given the features.
Features: Single measurements or descriptors of examples used in a machine learning task.
Probabilistic framework: A machine learning approach based on a probability distribution over the labels and features.
Missing data: An experimental condition in which some features are available for some, but not all, examples.
Feature selection: The process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training.
Input space: A set of features chosen to be used as input for a machine learning method.
Uniform prior: A prior distribution for a Bayesian model that assigns equal probabilities to all models.
Dirichlet mixture priors: Prior distributions for a Bayesian model over the relative frequencies of, for example, amino acids.
Kernel methods: A class of machine learning methods (for example, support vector machine) that use a type of similarity measure (called a kernel) between feature vectors.
Bayesian network: A representation of a probability distribution that specifies the structure of dependencies between variables as a network.
Curse of dimensionality: The observation that analysis can sometimes become more difficult as the number of features increases, particularly because overfitting becomes more likely.
Overfitting: A common pitfall in machine learning analysis that occurs when a complex model is trained on too few data points and becomes specific to the training data, resulting in poor performance on other data.
Label skew: A phenomenon in which two labels in a supervised learning problem are present at different frequencies.
Sensitivity: (Also known as recall). The fraction of positive examples identified; it is given by the number of positive predictions that are correct divided by the total number of positive examples.
Precision: The fraction of positive predictions that are correct; it is given by the number of positive predictions that are correct divided by the total number of positive predictions.
Precision-recall curve: For a binary classifier applied to a given data set, a curve that plots precision (y axis) versus recall (x axis) for a variety of classification thresholds.
Marginalization: A method for handling missing data points by summing over all possibilities for that random variable in the model.
Transitive relationships: An observed correlation between two features that is caused by direct relationships between these two features and a third feature.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Libbrecht, M., Noble, W. Machine learning applications in genetics and genomics. Nat Rev Genet 16, 321–332 (2015). https://doi.org/10.1038/nrg3920

Download citation

Published: 07 May 2015
Issue Date: June 2015
DOI: https://doi.org/10.1038/nrg3920

This article is cited by

AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes
- Yi-Jia Huang
- Chun-houh Chen
- Hsin-Chou Yang
Nature Communications (2024)
Benchmarking machine learning and parametric methods for genomic prediction of feed efficiency-related traits in Nellore cattle
- Lucio F. M. Mota
- Leonardo M. Arikawa
- Lucia G. Albuquerque
Scientific Reports (2024)
Genetic Clustering Algorithm-Based Feature Selection and Divergent Random Forest for Multiclass Cancer Classification Using Gene Expression Data
- L. Senbagamalar
- S. Logeswari
International Journal of Computational Intelligence Systems (2024)
Reduced order model of diffusion flames based on multi-scale data from detailed CFD: the impact of preprocessing
- Nicole Lopes Junqueira
- Louise da Costa Ramos
- Luís Fernando Figueira da Silva
Journal of the Brazilian Society of Mechanical Sciences and Engineering (2024)
An integrated ensemble learning technique for gene expression classification and biomarker identification from RNA-seq data for pancreatic cancer prognosis
- G. JagadeeswaraRao
- A. Sivaprasad
International Journal of Information Technology (2024)

Machine learning applications in genetics and genomics

Subjects

Key Points

Abstract

Access options

Similar content being viewed by others

Navigating the pitfalls of applying machine learning in genomics

Decoding disease: from genomes to networks to phenotypes

Computational analysis of cancer genome sequencing data

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

FURTHER INFORMATION

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

PowerPoint slide for Fig. 5

PowerPoint slide for Fig. 6

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

AI-enhanced integration of genetic and medical imaging data for risk assessment of Type 2 diabetes

Benchmarking machine learning and parametric methods for genomic prediction of feed efficiency-related traits in Nellore cattle

Genetic Clustering Algorithm-Based Feature Selection and Divergent Random Forest for Multiclass Cancer Classification Using Gene Expression Data

Reduced order model of diffusion flames based on multi-scale data from detailed CFD: the impact of preprocessing

An integrated ensemble learning technique for gene expression classification and biomarker identification from RNA-seq data for pancreatic cancer prognosis

Search

Quick links

Subjects

Key Points

Abstract

Access options

Similar content being viewed by others

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

FURTHER INFORMATION

PowerPoint slides

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links