[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning

Published: 01 January 2016 Publication History

Abstract

The era of genomics brings the potential of better DNA-based risk prediction and treatment. We explore this problem for chronic lymphocytic leukaemia that is one of the largest whole exome data set available from the NIH dbGaP database. We perform a standard next-generation sequence procedure to obtain Single-Nucleotide Polymorphism SNP variants and obtain a peak mean accuracy of 82% in our cross-validation study. We also cross-validate an Affymetrix 6.0 genome-wide association study of the same samples where we find a peak accuracy of 57%. We then perform a cross-study validation with exome samples from other studies in the NIH dbGaP database serving as the external data set. There we obtain an accuracy of 70% with top Pearson ranked SNPs obtained from the original exome data set. Our study shows that even with a small sample size we can obtain moderate to high accuracy with exome sequences, which is encouraging for future work.

References

[1]
Abraham, G., Kowalczyk, A., Zobel, J. and Inouye, M. (2013) 'Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease', Genetic Epidemiology, Vol. 37, No. 2, pp. 184-195.
[2]
Alpaydin, E. (2004) Machine Learning, MIT Press.
[3]
Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., Angel, G., Levy-Moonshine, A. et al. (2013) 'From Fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline', Current Protocols in Bioinformatics, pp. 11-10.
[4]
Banerji, S., Cibulskis, K., Rangel-Escareno, C., Brown, K.K., Carter, S.L., Frederick, A.M. et al. (2012) 'Sequence analysis of mutations and translocations across breast cancer subtypes', Nature, Vol. 486, No. 7403, pp. 405-409.
[5]
Bernau, C., Riester, M., Boulesteix, A., Parmigiani, G., Huttenhower, C., Waldron, L. and Trippa, L. (2014) 'Cross-study validation for the assessment of prediction algorithms', Bioinformatics, Vol. 30, No. 12, pp. i105-i112.
[6]
Berndt, S.I., Skibola, C.F., Joseph, V., Camp, N.J., Nieters, A., Wang, Z. et al. (2013) 'Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia', Nature Genetics, Vol. 45, pp. 868-876.
[7]
Carlson, C.S., Matise, T.C., North, K.E., Haiman, C.A., Fesinmeyer, M.D., Buyske, S. et al. (2013) 'Generalization and dilution of association results from European gwas in populations of non-European ancestry: the page study', PLoS Biol, Vol. 11, No. 9, e1001661.
[8]
Chatterjee, N., Wheeler, B., Sampson, J., Hartge, P., Chanock, S.J. and Park, J-H. (2013) 'Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies', Nature Genetics, Vol. 45, pp. 400-405.
[9]
Cheung, N., Fung, T.K., Zeisig, B.B., Holmes, K., Rane, J.K., Mowen, K.A. et al. (2016) 'Targeting aberrant epigenetic networks mediated by prmt1 and kdm4c in acute myeloid leukemia', Cancer Cell, Vol. 29, No. 1, pp. 32-48.
[10]
Cortes, C. and Vapnik, V. (1995) 'Support-vector networks', Machine Learning, Vol. 20, No. 3, pp. 273-297.
[11]
De Keersmaecker, K., Graux, C., Odero, M., Mentens, N., Somers, R., Maertens, J. et al. (2005) 'Fusion of eml1 to abl1 in t-cell acute lymphoblastic leukemia with cryptic t(9;14)(q34;q32)', Blood, Vol. 105, No. 12, pp. 4849-4852.
[12]
DePristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., Hartl, C. et al. (2011) 'A framework for variation discovery and genotyping using next-generation DNA sequencing data', Nature Genetics, Vol. 43, No. 5, pp. 491-498.
[13]
Di Bernardo, M.C., Crowther-Swanepoel, D., Broderick, P., Webb, E., Sellick, G., Wild, R. Et al. (2008) 'A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia', Nature Genetics, Vol. 40, No. 10, pp. 1204-1210.
[14]
Do, C.B., Hinds, D.A., Francke, U. and Eriksson, N. (2012) 'Comparison of family history and snps for predicting risk of complex disease', PLoS Genet, Vol. 8, No. 10, e1002973.
[15]
Eleftherohorinou, H., Wright, V., Hoggart, C., Hartikainen, A-L., Jarvelin, M-R., Balding, D. et al. (2009) 'Pathway analysis of gwas provides new insights into genetic susceptibility to 3 inflammatory diseases', PLoS ONE, Vol. 4, No. 11, e8068.
[16]
Emerenciano, M., Kowarz, E., Karl, K., de Almeida Lopes, B., Scholz, B., Bracharz, S. et al. (2013) 'Functional analysis of the two reciprocal fusion genes mll-nebl and nebl-mll reveal their oncogenic potential', Cancer Letters, Vol. 332, No. 1, pp. 30-34.
[17]
Evans, D.M., Visscher, P.M. and Wray, N.R. (2009) 'Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk', Human Molecular Genetics, Vol. 18, No. 18, pp. 3525-3531.
[18]
Fonseca, N., Rung, J., Brazma, A. and Marioni, J. (2012) 'Tools for mapping high-throughput sequencing data', Bioinformatics, Vol. 28, No. 24, pp. 3169-3177.
[19]
Freedman, B.I., Divers, J. and Palmer, N.D. (2013) 'Population ancestry and genetic risk for diabetes and kidney, cardiovascular, and bone disease: modifiable environmental factors may produce the cures', American Journal of Kidney Diseases, Vol. 62, No. 6, pp. 1165-1175.
[20]
Gail, M.H. (2008) 'Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk', N Engl J Med, Vol. 100, No. 14, pp. 1037-1041.
[21]
Guyon, I. and Elisseeff, A. (2003) 'An introduction to variable and feature selection', J. Mach. Learn. Res., Vol. 3, pp. 1157-1182.
[22]
Guyon, I., Gunn, S., Ben-Hur, A. and Dror, G. (2004) 'Result analysis of the nips 2003 feature selection challenge', Advances in Neural Information Processing Systems, pp. 545-552.
[23]
Hatem, A., Bozdag, D., Toland, A. and Catalyurek, U. (2013) 'Benchmarking short sequence mapping tools', BMC Bioinformatics, Vol. 14, No. 1, p.184.
[24]
Janssens, A.C.J.W. and van Duijn, C.M. (2008) 'Genome-based prediction of common diseases: advances and prospects', Human Molecular Genetics, Vol. 17(R2), pp. R166-R173.
[25]
Joachims, T. (1999) 'Making large-scale svm learning practical', in Schölkopf, N., Burges, C. and Smola, A. (Eds): Advances in Kernel Methods - Support Vector Learning, MIT Press.
[26]
Kathiresan, S., Melander, O., Anevski, D., Guiducci, C., Burtt, N.P., Roos, C. et al. (2008) 'Polymorphisms associated with cholesterol and risk of cardiovascular events', New England Journal of Medicine, Vol. 358, pp. 1240-1249.
[27]
Kim, D., Kwon, N. and Kim, S. (2014) 'Association of aminoacyl-trna synthetases with cancer', in Kim, S. (Ed): Aminoacyl-tRNA Synthetases in Biology and Medicine, Vol. 344 of Topics in Current Chemistry, Springer Netherlands, pp. 207-245.
[28]
Kooperberg, C., LeBlanc, M. and Obenchain, V. (2010) 'Risk prediction using genome-wide association studies', Genetic Epidemiology, Vol. 34, No. 7, pp. 643-652.
[29]
Kraft, P. and Hunter, D.J. (2009) 'Genetic risk prediction - are we there yet?' New England Journal of Medicine, Vol. 360, No. 17, pp. 1701-1703.
[30]
Kruppa, J., Ziegler, A. and König, I.R. (2012) 'Risk estimation and risk prediction using machine-learning methods', Human Genetics, Vol. 131, No. 10, pp. 1639-1654.
[31]
Landau, D.A., Carter, S.L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M.S. et al. (2013) 'Evolution and impact of subclonal mutations in chronic lymphocytic leukemia', Cell, Vol. 152, No. 4, pp. 714-726.
[32]
Li, H. and Durbin, R. (2009) 'Fast and accurate short read alignment with burrows wheeler transform', Bioinformatics, Vol. 25, No. 14, pp. 1754-1760.
[33]
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N. et al. (2009) 'The sequence alignment map format and SAM tools', Bioinformatics, Vol. 25, No. 16, pp. 2078-2079.
[34]
Lou, L. and Xu, B. (1997) 'Induction of apoptosis of human leukemia cells by ¿-anordrin', Chinese Journal of Cancer Research, Vol. 9, No. 1, pp. 1-5.
[35]
Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R. (2007) 'The ncbi dbgap database of genotypes and phenotypes', Nature Genetics, Vol. 39, No. 10, pp. 1181-1186.
[36]
Manolio, T.A. (2013) 'Bringing genome-wide association findings into clinical use', Nature Reviews Genetics, Vol. 14, pp. 549-558.
[37]
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A. et al. (2010) 'The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data', Genome Research, Vol. 20, No. 9, pp. 1297-1303.
[38]
Morrison, A.C., Bare, L.A., Chambless, L.E., Ellis, S.G., Malloy, M., Kane, J.P. et al. (2007) 'Prediction of coronary heart disease risk using a genetic risk score: the atherosclerosis risk in communities study', Am. J. Epidemiol, Vol. 166, No. 1, pp. 28-35.
[39]
Okser, S., Pahikkala, T. and Aittokallio, T. (2013) 'Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives', BioData Mining, Vol. 6, No. 1, p.5.
[40]
Park, S.G., Schimmel, P. and Kim, S. (2008) 'Aminoacyl trna synthetases and their connections to disease', Proceedings of the National Academy of Sciences, Vol. 105, No. 32, pp. 11043-11049.
[41]
Pasqualucci, L., Khiabanian, H., Fangazio, M., Vasishtha, M., Messina, M., Holmes, A.B. et al. (2014) 'Genetics of follicular lymphoma transformation', Cell Reports, Vol. 6, No. 1, pp. 130-140.
[42]
Paynter, N.P., Chasman, D.I., Buring, J.E., Shiffman, D., Cook, N.R. and Ridker, P.M. (2009) 'Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21.3', Annals of Internal Medicine, Vol. 150.
[43]
Pino-Yanes, M., Thakur, N., Gignoux, C.R., Galanter, J.M., Roth, L.A., Eng, C. et al. (2015) 'Genetic ancestry influences asthma susceptibility and lung function among latinos', Journal of Allergy and Clinical Immunology, Vol. 135, No. 1, pp. 228-235.
[44]
Roshan, U., Chikkagoudar, S., Wei, Z., Wang, K. and Hakonarson, H. (2011) 'Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest', Nucleic Acids Research, Vol. 39, No. 9, e62.
[45]
Sandhu, M., Wood, A. and Young, E. (2010) 'Genomic risk prediction', The Lancet, Vol. 376, pp. 1366-1367.
[46]
Schrodi, S.J., Mukherjee, S., Shan, Y., Tromp, G., Sninsky, J.J., Callear, A.P. et al. (2014) 'Genetic-based prediction of disease traits: prediction is very difficult, especially about the future', Frontiers in Genetics, Vol. 5, No. 162.
[47]
Shanshal, M. and Haddad, R.Y. (2012) 'Chronic lymphocytic leukemia', Disease-a-Month, Vol. 58, pp. 153-167.
[48]
Shigemizu, D., Abe, T., Morizono, T., Johnson, T.A., Boroevich, K.A., Hirakawa, Y. et al. (2014) 'The construction of risk prediction models using gwas data and its application to a type 2 diabetes prospective cohort', PLoS ONE, Vol. 9, No. 3, e92549.
[49]
Slager, S.L., Rabe, K.G., Achenbach, S.J., Vachon, C.M., Goldin, L.R., Strom, S.S. et al. (2011) 'Genome-wide association study identifies a novel susceptibility locus at 6p21.3 among familial cll', Blood, Vol. 117, No. 6, pp. 1911-1916.
[50]
Smialowski, P., Frishman, D. and Kramer, S. (2010) 'Pitfalls of supervised feature selection', Bioinformatics, Vol. 26, No. 3, pp. 440-443.
[51]
Speedy, H.E., Di Bernardo, M.C., Sava, G.P., Dyer, M.J.S., Holroyd, A., Wang, Y. (2014) 'A genome-wide association study identifies multiple susceptibility loci for chronic lymphocytic leukemia', Nat Genet, Vol. 46, pp. 56-60.
[52]
Stransky, N., Egloff, A.M., Tward, A.D., Kostic, A.D., Cibulskis, K., Sivachenko, A. et al. (2011) 'The mutational landscape of head and neck squamous cell carcinoma', Science, Vol. 333, No. 6046, pp. 1157-1160.
[53]
Visscher, P.M., Brown, M.A., McCarthy, M.I. and Yang, J. (2012) 'Five years of GWAS discovery', The American Journal of Human Genetics, Vol. 90, No. 1, pp. 7-24.
[54]
Wang, K., Li, M. and Hakonarson, H. (2010) 'ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data', Nucleic Acids Research, Vol. 38, No. 16, e164.
[55]
Wang, L., Lawrence, M.S., Wan, Y., Stojanov, P., Sougnez, C., Stevenson, K. et al. (2011) 'SF3B1 and other novel cancer genes in chronic lymphocytic leukemia', New England Journal of Medicine, Vol. 365, No. 26, pp. 2497-2506.
[56]
Wei, Z., Wang, W., Bradfield, J., Li, J., Cardinale, C., Frackelton, E. et al. (2013) 'Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease', The American Journal of Human Genetics, Vol. 92, No. 6, pp. 1008-1012.
[57]
Welcome Trust Case Control Consortium (2007) 'Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls', Nature, Vol. 447, pp. 661-678.
[58]
Wray, N.R., Goddard, M.E. and Visscher, P.M. (2007) 'Prediction of individual genetic risk to disease from genome-wide association studies', Genome Research, Vol. 17, pp. 1520-1528.
[59]
Wray, N.R., Goddard, M.E. and Visscher, P.M. (2008) 'Prediction of individual genetic risk of complex disease', Current Opinion in Genetics and Development, Vol. 18, pp. 257-263.
[60]
Wu, U., Zhang, X., Liu, Y., Lu, F. and Chen, X. (2016) 'Decreased expression of bnc1 and bnc2 is associated with genetic or epigenetic regulation in hepatocellular carcinoma', International Journal of Molecular Sciences, Vol. 17, No. 2, p.153.

Cited By

View all
  • (2018)Artificial neural network classification of microarray data using new hybrid gene selection methodInternational Journal of Data Mining and Bioinformatics10.1504/IJDMB.2017.08402617:1(42-65)Online publication date: 23-Dec-2018
  • (2018)Cross-validation and cross-study validation of kidney cancer with machine learning and whole exome sequences from the National Cancer Institute2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)10.1109/CIBCB.2018.8404967(1-6)Online publication date: 30-May-2018
  1. Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image International Journal of Data Mining and Bioinformatics
      International Journal of Data Mining and Bioinformatics  Volume 16, Issue 1
      January 2016
      91 pages
      ISSN:1748-5673
      EISSN:1748-5681
      Issue’s Table of Contents

      Publisher

      Inderscience Publishers

      Geneva 15, Switzerland

      Publication History

      Published: 01 January 2016

      Author Tags

      1. SNP variants
      2. SNPs
      3. bioinformatics
      4. chronic lymphocytic leukaemia
      5. cross-study validation
      6. cross-validation
      7. disease risk prediction
      8. exome sequences
      9. exome wide association studies
      10. machine learning
      11. next-generation sequencing
      12. single nucleotide polymorphisms

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)Artificial neural network classification of microarray data using new hybrid gene selection methodInternational Journal of Data Mining and Bioinformatics10.1504/IJDMB.2017.08402617:1(42-65)Online publication date: 23-Dec-2018
      • (2018)Cross-validation and cross-study validation of kidney cancer with machine learning and whole exome sequences from the National Cancer Institute2018 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)10.1109/CIBCB.2018.8404967(1-6)Online publication date: 30-May-2018

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media