Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction

Nic Herndon¹⁴ &
Doina Caragea¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 511))

Included in the following conference series:

International Joint Conference on Biomedical Engineering Systems and Technologies

633 Accesses

Abstract

Many biological problems that rely on machine learning do not have enough labeled data to use a classic classifier. To address this, we propose two domain adaptation algorithms, derived from the multinomial naïve Bayes classifier, that leverage the large corpus of labeled data from a similar, well-studied organism (the source domain), in conjunction with the unlabeled and some labeled data from an organism of interest (the target domain). When evaluated on the splice site prediction, a difficult and essential step in gene prediction, they correctly classified instances with highest average area under precision-recall curve (auPRC) values between 18.46 % and 78.01 %. We show that the algorithms learned meaningful patterns by evaluating them on shuffled instances and labels. Then we used one of the algorithms in an ensemble setting and produced even better results when there is not much labeled data or the domains are distantly related.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

An evaluation of approaches for using unlabeled data with domain adaptation

Article 07 July 2016

Predicting Protein Localization Using a Domain Adaptation Approach

Notes

1.
Downloaded from ftp://ftp.tuebingen.mpg.de/fml/cwidmer/
2.
WEKA Attribute-Relation File Format (ARFF) is described at http://www.cs.waikato.ac.nz/ml/weka/arff.html.

References

Gantz, J.H., Reinsel, D., Chute, C., Schlinchting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., Manfrediz, A.: The Expanding Digital Universe (2007)
Google Scholar
Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007)
Article MathSciNet Google Scholar
Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007)
Article Google Scholar
Müller, K.R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12, 181–202 (2001)
Article Google Scholar
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Article Google Scholar
Noble, W.S.: What is a support vector machine? Nat. Biotech. 24, 1565–1567 (2006)
Article Google Scholar
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, J.M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97, 262–267 (2000)
Article Google Scholar
Herndon, N., Caragea, D.: Naïve Bayes domain adaptation for biological sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)
Google Scholar
Herndon, N., Caragea, D.: Predicting protein localization using a domain adaptation approach. In: Fernández-Chimeno, M., Fernandes, P.L., Alvarez, S., Stacey, D., Solé-Casals, J., Fred, A., Gamboa, H. (eds.) Biomedical Engineering Systems and Technologies. CCIS, pp. 191–206. Springer, Heidelberg (2013)
Google Scholar
Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naive Bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)
Chapter Google Scholar
Maeireizo, B., Litman, D., Hwa, R.: Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. ACLdemo 2004. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL 1995, pp. 189–196. Association for Computational Linguistics, Stroudsburg (1995)
Google Scholar
Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)
Google Scholar
Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: NIPS 2008, pp. 1433–1440 (2008)
Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004)
Article Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
Article Google Scholar
Li, J., Wang, L., Wang, H., Bai, L., Yuan, Z.: High-accuracy splice site prediction based on sequence component and position features. Genet. Mol. Res. 11, 3431–3451 (2012)
Google Scholar
Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 7(Suppl 5), S15 (2006)
Article Google Scholar
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinform. 8, 1–16 (2007)
Article Google Scholar
Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Syst. Appl. 30, 73–81 (2006)
Article Google Scholar
Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling splice sites with Bayes networks. Bioinformatics 16, 152–158 (2000)
Article Google Scholar
Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N.: Biological sequence data preprocessing for classification: a case study in splice site identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007, Part II. LNCS, vol. 4492, pp. 1221–1230. Springer, Heidelberg (2007)
Chapter Google Scholar
Arita, M., Tsuda, K., Asai, K.: Modeling splicing sites with pairwise correlations. Bioinformatics 18(suppl 2), S27–S34 (2002)
Article Google Scholar
Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. In: Kernel Methods in Computational Biology. MIT Press series on Computational Molecular Biology. MIT Press (2004) 277–298
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Mccallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI-1998 Workshop on ‘Learning for Text Categorization’ (1998)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)
Article MathSciNet Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
MATH MathSciNet Google Scholar

Download references

Acknowledgements

Supported in part by the Kansas INBRE, P20 GM103418. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CNS-1006860, EPS-1006860, EPS-0919443, and MRI-1126709.

Author information

Authors and Affiliations

Kansas State University, 234 Nichols Hall, Manhattan, KS, 66506, USA
Nic Herndon & Doina Caragea

Authors

Nic Herndon
View author publications
You can also search for this author in PubMed Google Scholar
Doina Caragea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nic Herndon .

Editor information

Editors and Affiliations

ESEO, ANGERS CEDEX 02, France
Guy Plantier
Cognitive Systems Lab., Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Tanja Schultz
Technical University of Lisbon, Lisbon, Portugal
Ana Fred
New University of Lisbon, Lisboa, Portugal
Hugo Gamboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herndon, N., Caragea, D. (2015). Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction. In: Plantier, G., Schultz, T., Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2014. Communications in Computer and Information Science, vol 511. Springer, Cham. https://doi.org/10.1007/978-3-319-26129-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-26129-4_13
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26128-7
Online ISBN: 978-3-319-26129-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

An evaluation of approaches for using unlabeled data with domain adaptation

Predicting Protein Localization Using a Domain Adaptation Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Empirical Study of Domain Adaptation Algorithms on the Task of Splice Site Prediction

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

An evaluation of approaches for using unlabeled data with domain adaptation

Predicting Protein Localization Using a Domain Adaptation Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation