An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset

Marcilio C. P. de Souto²⁰,
Valnaide G. Bittencourt²¹ &
Jose A. F. Costa²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4234))

Included in the following conference series:

International Conference on Neural Information Processing

2372 Accesses
6 Citations

Abstract

There have been a great deal of research on learning from imbalanced datasets. Among the widely used methods proposed to solve such a problem, the most common are based either on under or over sampling of the original dataset. In this work, we evaluate several methods of under-sampling, such as Tomek Links, with the goal of improving the performance of the classifiers generated by different ML algorithms (decision trees, support vector machines, among others) applied to problem of determining the structural similarity of proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Imbalanced Data Classification Using a Relevant Information-Based Sampling Approach

Oversampling for Mining Imbalanced Datasets: Taxonomy and Performance Evaluation

How to balance the bioinformatics data: pseudo-negative sampling

Article Open access 24 December 2019

References

Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies. In: Proc. of the AAAI Worrkshop on Learning from Imbalanced Data Sets, pp. 10–15 (2000)
Google Scholar
Fawcett, T., Provost, J.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1997)
Article Google Scholar
Kubat, M., Holte, R., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30, 195–215 (1998)
Article Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent data Analyis 6, 429–449 (2002)
MATH Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 20–29 (2004)
Article Google Scholar
Baldi, P., Brunak, S.: Bioinformatics: the Machine Learning approach, 2nd edn. MIT Press, Cambridge (2001)
MATH Google Scholar
Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)
Article Google Scholar
Tan, A., Gilbert, D., Deville, Y.: Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14, 206–217 (2003)
Google Scholar
Craven, M.W., Mural, R.J., Hauser, L.J., Uberbacher, E.C.: Predicting protein folding classes without overly relying on homology. In: Proc. of ISBM, pp. 98–106 (1995)
Google Scholar
Lo Conte, L., Ailey, B., Hubbard, T., Brenner, S., Murzin, A., Chotia, C.: SCOP: a structural classification of proteins database. Nucleic Acids Research 28, 257–259 (2000)
Article Google Scholar
Chinnasamy, A., Sung, W., Mittal, A.: Protein structure and fold prediction using tree-augmented nave bayesian classifier. In: Proc. of the Pacific Symposium on Biocomputing, vol. 9, pp. 387–398 (2004)
Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Communications 6, 769–772 (1976)
Article MATH MathSciNet Google Scholar
Hart, P.: The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14, 515–516 (1968)
Article Google Scholar
Batista, G.E.A.P.A., Carvalho, A.C.P.L.F., Monard, M.C.: Applying one-sided selection to unbalanced datasets. In: Proc. of the Mexican International Conference on Artificial Intelligence, pp. 315–325 (2000)
Google Scholar
Laurikkala, J.: Improving identification of dificult small classes by balancing class distribution. A-2001-2, University of Tampere (2001)
Google Scholar
Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Communications 2, 408–421 (1972)
Article MATH Google Scholar
Hobohm, U., Sander, C.: Enlarged representative set of proteins. Protein Science 3, 522–524 (1994)
Article Google Scholar
Dubchak, I., Muchnik, I., Kim, S.: Protein folding class predictor for SCOP: Approach based on global descriptors. In: Proc. of the Intelligent Systems for Molecular Biology, pp. 104–107 (1997)
Google Scholar
Chandonia, J.M., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.: Astral compendium enhancements. Nucleic Acids Research 30, 260–263 (2002)
Article Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Dietterich, T.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10, 1895–1923 (1998)
Article Google Scholar
Batista, G.E.A.P.A.: Pre-processamento de dados em Aprendizado de Mquina Supervisionado. PhD thesis, Universidade de So Paulo, Instituto de Cincias Matemticas e de Computao (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics and Applied Mathematics, Federal University of Rio Grande do Norte, Campus Universitario, 59072-970, Natal-RN, Brazil
Marcilio C. P. de Souto
Department of Computing and Automation, Federal University of Rio Grande do Norte, Campus Universitario, 59072-970, Natal-RN, Brazil
Valnaide G. Bittencourt
Department of Electric Engineering, Federal University of Rio Grande do Norte, Campus Universitario, 59072-970, Natal-RN, Brazil
Jose A. F. Costa

Authors

Marcilio C. P. de Souto
View author publications
You can also search for this author in PubMed Google Scholar
Valnaide G. Bittencourt
View author publications
You can also search for this author in PubMed Google Scholar
Jose A. F. Costa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Engineering, The Chinese Univ. of Hong Kong, Shatin, N.T., Hong Kong
Irwin King
Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong, New Territories,, China
Jun Wang
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Lai-Wan Chan
Department of Computer Science and Engineering & Center for Cognitive Science, The Ohio State University, OH 43210, Columbus
DeLiang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Souto, M.C.P., Bittencourt, V.G., Costa, J.A.F. (2006). An Empirical Analysis of Under-Sampling Techniques to Balance a Protein Structural Class Dataset. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893295_3

Download citation

DOI: https://doi.org/10.1007/11893295_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46484-6
Online ISBN: 978-3-540-46485-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics