[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction

  • Conference paper
  • First Online:
Biomedical Engineering Systems and Technologies (BIOSTEC 2014)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 511))

  • 633 Accesses

Abstract

Many biological problems that rely on machine learning do not have enough labeled data to use a classic classifier. To address this, we propose two domain adaptation algorithms, derived from the multinomial naïve Bayes classifier, that leverage the large corpus of labeled data from a similar, well-studied organism (the source domain), in conjunction with the unlabeled and some labeled data from an organism of interest (the target domain). When evaluated on the splice site prediction, a difficult and essential step in gene prediction, they correctly classified instances with highest average area under precision-recall curve (auPRC) values between 18.46 % and 78.01 %. We show that the algorithms learned meaningful patterns by evaluating them on shuffled instances and labels. Then we used one of the algorithms in an ensemble setting and produced even better results when there is not much labeled data or the domains are distantly related.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Downloaded from ftp://ftp.tuebingen.mpg.de/fml/cwidmer/

  2. 2.

    WEKA Attribute-Relation File Format (ARFF) is described at http://www.cs.waikato.ac.nz/ml/weka/arff.html.

References

  1. Gantz, J.H., Reinsel, D., Chute, C., Schlinchting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The Expanding Digital Universe (2007)

    Google Scholar 

  2. Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007)

    Article  MathSciNet  Google Scholar 

  3. Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007)

    Article  Google Scholar 

  4. Müller, K.R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12, 181–202 (2001)

    Article  Google Scholar 

  5. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)

    Article  Google Scholar 

  6. Noble, W.S.: What is a support vector machine? Nat. Biotech. 24, 1565–1567 (2006)

    Article  Google Scholar 

  7. Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, J.M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97, 262–267 (2000)

    Article  Google Scholar 

  8. Herndon, N., Caragea, D.: Naïve Bayes domain adaptation for biological sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)

    Google Scholar 

  9. Herndon, N., Caragea, D.: Predicting protein localization using a domain adaptation approach. In: Fernández-Chimeno, M., Fernandes, P.L., Alvarez, S., Stacey, D., Solé-Casals, J., Fred, A., Gamboa, H. (eds.) Biomedical Engineering Systems and Technologies. CCIS, pp. 191–206. Springer, Heidelberg (2013)

    Google Scholar 

  10. Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naive Bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  11. Maeireizo, B., Litman, D., Hwa, R.: Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. ACLdemo 2004. Association for Computational Linguistics, Stroudsburg (2004)

    Google Scholar 

  12. Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  13. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL 1995, pp. 189–196. Association for Computational Linguistics, Stroudsburg (1995)

    Google Scholar 

  14. Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)

    Google Scholar 

  15. Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: NIPS 2008, pp. 1433–1440 (2008)

    Google Scholar 

  16. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004)

    Article  Google Scholar 

  17. He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)

    Article  Google Scholar 

  18. Li, J., Wang, L., Wang, H., Bai, L., Yuan, Z.: High-accuracy splice site prediction based on sequence component and position features. Genet. Mol. Res. 11, 3431–3451 (2012)

    Google Scholar 

  19. Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 7(Suppl 5), S15 (2006)

    Article  Google Scholar 

  20. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinform. 8, 1–16 (2007)

    Article  Google Scholar 

  21. Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Syst. Appl. 30, 73–81 (2006)

    Article  Google Scholar 

  22. Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling splice sites with Bayes networks. Bioinformatics 16, 152–158 (2000)

    Article  Google Scholar 

  23. Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N.: Biological sequence data preprocessing for classification: a case study in splice site identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007, Part II. LNCS, vol. 4492, pp. 1221–1230. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  24. Arita, M., Tsuda, K., Asai, K.: Modeling splicing sites with pairwise correlations. Bioinformatics 18(suppl 2), S27–S34 (2002)

    Article  Google Scholar 

  25. Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. In: Kernel Methods in Computational Biology. MIT Press series on Computational Molecular Biology. MIT Press (2004) 277–298

    Google Scholar 

  26. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  27. Mccallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI-1998 Workshop on ‘Learning for Text Categorization’ (1998)

    Google Scholar 

  28. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)

    Article  MathSciNet  Google Scholar 

  29. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

Supported in part by the Kansas INBRE, P20 GM103418. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CNS-1006860, EPS-1006860, EPS-0919443, and MRI-1126709.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nic Herndon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Herndon, N., Caragea, D. (2015). Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26129-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26128-7

  • Online ISBN: 978-3-319-26129-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics