On the Relation Between kNN Accuracy and Dataset Compression Level

Marcin Blachnik¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9692))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1185 Accesses

Abstract

The paper discusses how instance selection can be used to asses the kNN performance. There exists a strong correlation between the compression level of the dataset obtained by instance selection methods and the prediction accuracy obtained the k-NN classifier trained on full training dataset.

Based on two standard algorithms of instance selection namely CNN and ENN, which belong to two different groups of methods, so called condensation and editing methods, we perform empirical analysis to verify this relation. The obtained results show that this relation is almost linear, so that the level of compression is linearly correlated with the accuracy. In other words by knowing the compression of instance selection methods we are able to estimate the accuracy of the final kNN prediction model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Optimal instance subset selection from big data using genetic algorithm and open source framework

Article Open access 05 July 2022

Experimental Comparisons of Instances Set Reduction Algorithms

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Asuncion, A., Newman, D.: UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Duch, W., Wieczorek, T., Biesiada, J., Blachnik, M.: Comparision of feature ranking methods based on information entropy. In: Proceedings of International Joint Conference on Neural Networks, pp. 1415–1420. IEEE Press, Budapest (2004)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th Very Large Database (VLDB) Conference (1999)
Google Scholar
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 16, 515–516 (1968)
Article Google Scholar
Herrera, F.: Keel, knowledge extraction based on evolutionary learning (2005). http://www.keel.es, spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06
Kachel, A., Biesiada, J., Blachnik, M., Duch, W.: Infosel++: Information based feature Selection C++ library. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010, Part I. LNCS, vol. 6113, pp. 388–396. Springer, Heidelberg (2010)
Chapter Google Scholar
Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Springer, New York (1998)
Book MATH Google Scholar
Omohundro, S.M.: Five balltree construction algorithms. Technical report. TR-89-063, International Computer Science Institute, December 1989
Google Scholar
Salvador, G., Joaquin, D., Jose, R.C., Francisco, H.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012)
Article Google Scholar
Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Proc. Lett. 40(4), 175–179 (1991)
Article MATH Google Scholar
Wilson, D.: Assymptotic properties of nearest neighbour rules using edited data. IEEE Trans. Syst. Man Cybernetics SMC 2, 408–421 (1972)
Article MATH Google Scholar
Yao, Y.Y.: Information-theoretic measures for knowledge discovery and data mining. In: Karmeshu, P. (ed.) Entropy Measures, Maximum Entropy Principle and Emerging Applications. STUDFUZZ, vol. 119, pp. 115–136. Springer, Heidelberg (2003)
Chapter Google Scholar
Perlich, C.: Learning curves in machine learning. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 577–580. Springer, New York (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, Silesian University of Technology, Krasinskiego 8, Katowice, Poland
Marcin Blachnik

Authors

Marcin Blachnik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Blachnik .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Czestochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Czestochowa, Poland
Marcin Korytkowski
Częstochowa University of Technology, Czestochowa, Poland
Rafał Scherer
AGH University of Science and Technology, Krakow, Poland
Ryszard Tadeusiewicz
University of California, Berkeley, California, USA
Lotfi A. Zadeh
University of Louisville, Louisville, Kentucky, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blachnik, M. (2016). On the Relation Between kNN Accuracy and Dataset Compression Level. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9692. Springer, Cham. https://doi.org/10.1007/978-3-319-39378-0_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-39378-0_46
Published: 29 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39377-3
Online ISBN: 978-3-319-39378-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Relation Between kNN Accuracy and Dataset Compression Level

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal instance subset selection from big data using genetic algorithm and open source framework

Experimental Comparisons of Instances Set Reduction Algorithms

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

On the Relation Between kNN Accuracy and Dataset Compression Level

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal instance subset selection from big data using genetic algorithm and open source framework

Experimental Comparisons of Instances Set Reduction Algorithms

Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation