[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset

  • Conference paper
Neural Information Processing (ICONIP 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4234))

Included in the following conference series:

Abstract

There have been a great deal of research on learning from imbalanced datasets. Among the widely used methods proposed to solve such a problem, the most common are based either on under or over sampling of the original dataset. In this work, we evaluate several methods of under-sampling, such as Tomek Links, with the goal of improving the performance of the classifiers generated by different ML algorithms (decision trees, support vector machines, among others) applied to problem of determining the structural similarity of proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 91.50
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies. In: Proc. of the AAAI Worrkshop on Learning from Imbalanced Data Sets, pp. 10–15 (2000)

    Google Scholar 

  2. Fawcett, T., Provost, J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)

    Article  Google Scholar 

  3. Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)

    Article  Google Scholar 

  4. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent data Analyis 6, 429–449 (2002)

    MATH  Google Scholar 

  5. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)

    Article  Google Scholar 

  6. Baldi, P., Brunak, S.: Bioinformatics: the Machine Learning approach, 2nd edn. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  7. Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)

    Article  Google Scholar 

  8. Tan, A., Gilbert, D., Deville, Y.: Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14, 206–217 (2003)

    Google Scholar 

  9. Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proc. of ISBM, pp. 98–106 (1995)

    Google Scholar 

  10. Lo Conte, L., Ailey, B., Hubbard, T., Brenner, S., Murzin, A., Chotia, C.: SCOP: a structural classification of proteins database. Nucleic Acids Research 28, 257–259 (2000)

    Article  Google Scholar 

  11. Chinnasamy, A., Sung, W., Mittal, A.: Protein structure and fold prediction using tree-augmented nave bayesian classifier. In: Proc. of the Pacific Symposium on Biocomputing, vol. 9, pp. 387–398 (2004)

    Google Scholar 

  12. Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Communications 6, 769–772 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  13. Hart, P.: The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14, 515–516 (1968)

    Article  Google Scholar 

  14. Batista, G.E.A.P.A., Carvalho, A.C.P.L.F., Monard, M.C.: Applying one-sided selection to unbalanced datasets. In: Proc. of the Mexican International Conference on Artificial Intelligence, pp. 315–325 (2000)

    Google Scholar 

  15. Laurikkala, J.: Improving identification of dificult small classes by balancing class distribution. A-2001-2, University of Tampere (2001)

    Google Scholar 

  16. Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Communications 2, 408–421 (1972)

    Article  MATH  Google Scholar 

  17. Hobohm, U., Sander, C.: Enlarged representative set of proteins. Protein Science 3, 522–524 (1994)

    Article  Google Scholar 

  18. Dubchak, I., Muchnik, I., Kim, S.: Protein folding class predictor for SCOP: Approach based on global descriptors. In: Proc. of the Intelligent Systems for Molecular Biology, pp. 104–107 (1997)

    Google Scholar 

  19. Chandonia, J.M., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.: Astral compendium enhancements. Nucleic Acids Research 30, 260–263 (2002)

    Article  Google Scholar 

  20. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  21. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  22. Dietterich, T.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 1895–1923 (1998)

    Article  Google Scholar 

  23. Batista, G.E.A.P.A.: Pre-processamento de dados em Aprendizado de Mquina Supervisionado. PhD thesis, Universidade de So Paulo, Instituto de Cincias Matemticas e de Computao (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F. (2006). An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893295_3

Download citation

  • DOI: https://doi.org/10.1007/11893295_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46484-6

  • Online ISBN: 978-3-540-46485-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics