Abstract
Many biological problems that rely on machine learning do not have enough labeled data to use a classic classifier. To address this, we propose two domain adaptation algorithms, derived from the multinomial naïve Bayes classifier, that leverage the large corpus of labeled data from a similar, well-studied organism (the source domain), in conjunction with the unlabeled and some labeled data from an organism of interest (the target domain). When evaluated on the splice site prediction, a difficult and essential step in gene prediction, they correctly classified instances with highest average area under precision-recall curve (auPRC) values between 18.46 % and 78.01 %. We show that the algorithms learned meaningful patterns by evaluating them on shuffled instances and labels. Then we used one of the algorithms in an ensemble setting and produced even better results when there is not much labeled data or the domains are distantly related.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Downloaded from ftp://ftp.tuebingen.mpg.de/fml/cwidmer/
- 2.
WEKA Attribute-Relation File Format (ARFF) is described at http://www.cs.waikato.ac.nz/ml/weka/arff.html.
References
Gantz, J.H., Reinsel, D., Chute, C., Schlinchting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The Expanding Digital Universe (2007)
Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007)
Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007)
Müller, K.R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12, 181–202 (2001)
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Noble, W.S.: What is a support vector machine? Nat. Biotech. 24, 1565–1567 (2006)
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, J.M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97, 262–267 (2000)
Herndon, N., Caragea, D.: Naïve Bayes domain adaptation for biological sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)
Herndon, N., Caragea, D.: Predicting protein localization using a domain adaptation approach. In: Fernández-Chimeno, M., Fernandes, P.L., Alvarez, S., Stacey, D., Solé-Casals, J., Fred, A., Gamboa, H. (eds.) Biomedical Engineering Systems and Technologies. CCIS, pp. 191–206. Springer, Heidelberg (2013)
Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naive Bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)
Maeireizo, B., Litman, D., Hwa, R.: Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. ACLdemo 2004. Association for Computational Linguistics, Stroudsburg (2004)
Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL 1995, pp. 189–196. Association for Computational Linguistics, Stroudsburg (1995)
Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)
Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: NIPS 2008, pp. 1433–1440 (2008)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
Li, J., Wang, L., Wang, H., Bai, L., Yuan, Z.: High-accuracy splice site prediction based on sequence component and position features. Genet. Mol. Res. 11, 3431–3451 (2012)
Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 7(Suppl 5), S15 (2006)
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinform. 8, 1–16 (2007)
Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Syst. Appl. 30, 73–81 (2006)
Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling splice sites with Bayes networks. Bioinformatics 16, 152–158 (2000)
Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N.: Biological sequence data preprocessing for classification: a case study in splice site identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007, Part II. LNCS, vol. 4492, pp. 1221–1230. Springer, Heidelberg (2007)
Arita, M., Tsuda, K., Asai, K.: Modeling splicing sites with pairwise correlations. Bioinformatics 18(suppl 2), S27–S34 (2002)
Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. In: Kernel Methods in Computational Biology. MIT Press series on Computational Molecular Biology. MIT Press (2004) 277–298
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)
Mccallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI-1998 Workshop on ‘Learning for Text Categorization’ (1998)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
Acknowledgements
Supported in part by the Kansas INBRE, P20 GM103418. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CNS-1006860, EPS-1006860, EPS-0919443, and MRI-1126709.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Herndon, N., Caragea, D. (2015). Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-26129-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26128-7
Online ISBN: 978-3-319-26129-4
eBook Packages: Computer ScienceComputer Science (R0)