Abstract
There have been a great deal of research on learning from imbalanced datasets. Among the widely used methods proposed to solve such a problem, the most common are based either on under or over sampling of the original dataset. In this work, we evaluate several methods of under-sampling, such as Tomek Links, with the goal of improving the performance of the classifiers generated by different ML algorithms (decision trees, support vector machines, among others) applied to problem of determining the structural similarity of proteins.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies. In: Proc. of the AAAI Worrkshop on Learning from Imbalanced Data Sets, pp. 10–15 (2000)
Fawcett, T., Provost, J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)
Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent data Analyis 6, 429–449 (2002)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)
Baldi, P., Brunak, S.: Bioinformatics: the Machine Learning approach, 2nd edn. MIT Press, Cambridge (2001)
Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)
Tan, A., Gilbert, D., Deville, Y.: Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14, 206–217 (2003)
Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proc. of ISBM, pp. 98–106 (1995)
Lo Conte, L., Ailey, B., Hubbard, T., Brenner, S., Murzin, A., Chotia, C.: SCOP: a structural classification of proteins database. Nucleic Acids Research 28, 257–259 (2000)
Chinnasamy, A., Sung, W., Mittal, A.: Protein structure and fold prediction using tree-augmented nave bayesian classifier. In: Proc. of the Pacific Symposium on Biocomputing, vol. 9, pp. 387–398 (2004)
Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Communications 6, 769–772 (1976)
Hart, P.: The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14, 515–516 (1968)
Batista, G.E.A.P.A., Carvalho, A.C.P.L.F., Monard, M.C.: Applying one-sided selection to unbalanced datasets. In: Proc. of the Mexican International Conference on Artificial Intelligence, pp. 315–325 (2000)
Laurikkala, J.: Improving identification of dificult small classes by balancing class distribution. A-2001-2, University of Tampere (2001)
Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Communications 2, 408–421 (1972)
Hobohm, U., Sander, C.: Enlarged representative set of proteins. Protein Science 3, 522–524 (1994)
Dubchak, I., Muchnik, I., Kim, S.: Protein folding class predictor for SCOP: Approach based on global descriptors. In: Proc. of the Intelligent Systems for Molecular Biology, pp. 104–107 (1997)
Chandonia, J.M., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.: Astral compendium enhancements. Nucleic Acids Research 30, 260–263 (2002)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)
Dietterich, T.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 1895–1923 (1998)
Batista, G.E.A.P.A.: Pre-processamento de dados em Aprendizado de Mquina Supervisionado. PhD thesis, Universidade de So Paulo, Instituto de Cincias Matemticas e de Computao (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F. (2006). An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893295_3
Download citation
DOI: https://doi.org/10.1007/11893295_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46484-6
Online ISBN: 978-3-540-46485-3
eBook Packages: Computer ScienceComputer Science (R0)