More Web Proxy on the site http://driver.im/

research-article

Efficient dataset size reduction by finding homogeneous clusters

Authors:

Stefanos Ougiaroglou,

Georgios EvangelidisAuthors Info & Claims

BCI '12: Proceedings of the Fifth Balkan Conference in Informatics

Pages 168 - 173

https://doi.org/10.1145/2371316.2371349

Published: 16 September 2012 Publication History

Abstract

Although the k-Nearest Neighbor classifier is one of the most widely-used classification methods, it suffers from the high computational cost and storage requirements it involves. These major drawbacks have constituted an active research field over the last decades. This paper proposes an effective data reduction algorithm that has low preprocessing cost and reduces storage requirements while maintaining classification accuracy at an acceptable high level. The proposed algorithm is based on a fast pre-processing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training-set. Experimental results, based on real-life datasets, illustrate that the proposed algorithm is faster and achieves higher reduction rates than three known existing methods, while it does not significantly reduce the classification accuracy.

References

[1]

D. W. Aha, D. F. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37--66, 1991.

[2]

J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Logic and Soft Computing, 17(2--3):255--287, 2011.

[3]

H. Brighton and C. Mellish. Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov., 6(2):153--172, Apr. 2002.

Digital Library

[4]

C.-L. Chang. Finding prototypes for nearest neighbor classifiers. IEEE Trans. Comput., 23(11):1179--1184, Nov. 1974.

Digital Library

[5]

C. H. Chen and A. Jóźwik. A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn. Lett., 17:819--823, July 1996.

Digital Library

[6]

C.-H. Chou, B.-H. Kuo, and F. Chang. The generalized condensed nearest neighbor rule as a data reduction method. In Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, ICPR '06, pages 556--559, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[7]

B. V. Dasarathy. Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, 1991.

[8]

P. Datta and D. F. Kibler. Learning symbolic prototypes. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, pages 75--82, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

Digital Library

[9]

V. S. Devi and M. N. Murty. An incremental prototype set building technique. Pattern Recognition, 35(2):505--513, 2002.

[10]

H. A. Fayed, S. R. Hashem, and A. F. Atiya. Self-generating prototypes for pattern classification. Pattern Recogn., 40:1498--1509, May 2007.

Digital Library

[11]

S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):417--435, 2012.

Digital Library

[12]

G. W. Gates. The reduced nearest neighbor rule. ieee transactions on information theory. IEEE Transactions on Information Theory, 18(3):431--433, 1972.

Digital Library

[13]

M. Grochowski and N. Jankowski. Comparison of instance selection algorithms ii. results and comments. In Artificial Intelligence and Soft Computing - ICAISC 2004, volume 3070 of Lecture Notes in Computer Science, pages 580--585. Springer Berlin/Heidelberg, 2004.

[14]

P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14(3):515--516, 1968.

Digital Library

[15]

N. Jankowski and M. Grochowski. Comparison of instances seletion algorithms i. algorithms survey. In Artificial Intelligence and Soft Computing - ICAISC 2004, volume 3070 of Lecture Notes in Computer Science, pages 598--603. Springer Berlin/Heidelberg, 2004.

[16]

M. Lozano. Data Reduction Techniques in Classification processes (Phd Thesis). Universitat Jaume I, 2007.

[17]

J. McQueen. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symp. on Math. Statistics and Probability, pages 281--298, Berkeley, CA: University of California Press, 1967.

[18]

R. Mollineda, F. Ferri, and E. Vidal. An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recognition, 35(12):2771--2782, 2002.

[19]

J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, and J. Kittler. A review of instance selection methods. Artif. Intell. Rev., 34(2):133--143, Aug. 2010.

Digital Library

[20]

J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, and J. F. M. Trinidad. A new fast prototype selection method based on clustering. Pattern Anal. Appl., 13(2):131--141, 2010.

Digital Library

[21]

G. Ritter, H. Woodruff, S. Lowry, and T. Isenhour. An algorithm for a selective nearest neighbor decision rule. IEEE Trans. on Inf. Theory, 21(6):665ï£¡669, 1975.

Digital Library

[22]

H. Samet. Foundations of multidimensional and metric data structures. The Morgan Kaufmann series in computer graphics. Elsevier/Morgan Kaufmann, 2006.

Digital Library

[23]

J. S. Sánchez. High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition, 37(7):1561--1564, 2004.

[24]

G. Toussaint. Proximity graphs for nearest neighbor decision rules: Recent progress. In 34th Symposium on the INTERFACE, pages 17--20, 2002.

[25]

I. Triguero, J. Derrac, S. García, and F. Herrera. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 42(1):86--100, 2012.

Digital Library

[26]

D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE trans. on systems, man, and cybernetics, 2(3):408--421, July 1972.

[27]

D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000.

Digital Library

Cited By

Panja AKarim SNeogy SChowdhury C(2024)Improving the sustainability of WiFi-enabled indoor localization systems through meta-heuristic based instance selection approachExpert Systems with Applications10.1016/j.eswa.2024.125063257(125063)Online publication date: Jan-2024
https://doi.org/10.1016/j.eswa.2024.125063
Bluteau RGras R(2023)Lottery Ticket Search on Untrained Models with Applied Lottery Sample SelectionMachine Learning and Knowledge Extraction10.3390/make50200245:2(400-417)Online publication date: 18-Apr-2023
https://doi.org/10.3390/make5020024
Stage HEwecker LLangner JSohn TVillmann TSax E(2023)Reducing Computer Vision Dataset Size via Selective Sampling2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC57777.2023.10422460(1422-1428)Online publication date: 24-Sep-2023
https://doi.org/10.1109/ITSC57777.2023.10422460
Show More Cited By

Recommendations

Efficient editing and data abstraction by finding homogeneous clusters

The efficiency of the k-Nearest Neighbour classifier depends on the size of the training set as well as the level of noise in it. Large datasets with high level of noise lead to less accurate classifiers with high computational cost and storage ...
RHC: a non-parametric cluster-based data reduction for efficient $$k$$k-NN classification

Although the $$k$$k-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient $$k$$k-NN ...
Very fast variations of training set size reduction algorithms for instance-based classification
IDEAS '23: Proceedings of the 27th International Database Engineered Applications Symposium

Reduction through Homogeneous Clustering (RHC) and its editing variant (ERHC) are effective data reduction techniques for the k-NN classifier. They are based on an iterative k-means clustering task that discovers homogeneous clusters. The centers of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

BCI '12: Proceedings of the Fifth Balkan Conference in Informatics

September 2012

312 pages

ISBN:9781450312400

DOI:10.1145/2371316

General Chair:
Mirjana Ivanović
Univ. of Novi Sad, Serbia
,
Program Chair:
Zoran Budimac
Univ. of Novi Sad, Serbia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

MSTD: Ministry of Education, Science and Technological Development - Serbia
Novi Sad: Faculty of Technical Sciences, University of Novi Sad

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

BCI '12

Sponsor:

MSTD
Novi Sad

BCI '12: Balkan Conference in Informatics, 2012

September 16 - 20, 2012

Novi Sad, Serbia

Acceptance Rates

Overall Acceptance Rate 97 of 250 submissions, 39%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
184
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Panja AKarim SNeogy SChowdhury C(2024)Improving the sustainability of WiFi-enabled indoor localization systems through meta-heuristic based instance selection approachExpert Systems with Applications10.1016/j.eswa.2024.125063257(125063)Online publication date: Jan-2024
https://doi.org/10.1016/j.eswa.2024.125063
Bluteau RGras R(2023)Lottery Ticket Search on Untrained Models with Applied Lottery Sample SelectionMachine Learning and Knowledge Extraction10.3390/make50200245:2(400-417)Online publication date: 18-Apr-2023
https://doi.org/10.3390/make5020024
Stage HEwecker LLangner JSohn TVillmann TSax E(2023)Reducing Computer Vision Dataset Size via Selective Sampling2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC57777.2023.10422460(1422-1428)Online publication date: 24-Sep-2023
https://doi.org/10.1109/ITSC57777.2023.10422460
Panja ARayala AAgarwala ANeogy SChowdhury C(2023)A hybrid tuple selection pipeline for smartphone based Human Activity RecognitionExpert Systems with Applications10.1016/j.eswa.2023.119536217(119536)Online publication date: May-2023
https://doi.org/10.1016/j.eswa.2023.119536
Byerly AKalganova T(2023)Towards an Analytical Definition of Sufficient DataSN Computer Science10.1007/s42979-022-01549-44:2Online publication date: 7-Jan-2023
https://doi.org/10.1007/s42979-022-01549-4
Guibert AHurter CCouellan N(2023)LRP-GUS: A Visual Based Data Reduction Algorithm for Neural NetworksArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44192-9_27(337-349)Online publication date: 22-Sep-2023
https://doi.org/10.1007/978-3-031-44192-9_27
Steadman LGriffiths NJarvis SBell MHelman SWallbank C(2021)kD-STR: A Method for Spatio-Temporal Data Reduction and ModellingACM/IMS Transactions on Data Science10.1145/34393342:3(1-31)Online publication date: 18-May-2021
https://dl.acm.org/doi/10.1145/3439334
Panhalkar ADoye D(2021)An approach of improving decision tree classifier using condensed informative dataDECISION10.1007/s40622-020-00265-3Online publication date: 28-Jan-2021
https://doi.org/10.1007/s40622-020-00265-3
Ponos POugiaroglou SEvangelidis G(2019)Improving Data Reduction by Merging PrototypesAdvances in Databases and Information Systems10.1007/978-3-030-28730-6_2(20-32)Online publication date: 13-Aug-2019
https://doi.org/10.1007/978-3-030-28730-6_2
Ougiaroglou SArampatzis GDervos DEvangelidis G(2017)Generating Fixed-Size Training Sets for Large and Streaming DatasetsAdvances in Databases and Information Systems10.1007/978-3-319-66917-5_7(88-102)Online publication date: 25-Aug-2017
https://doi.org/10.1007/978-3-319-66917-5_7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten