[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2371316.2371349acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbciConference Proceedingsconference-collections
research-article

Efficient dataset size reduction by finding homogeneous clusters

Published: 16 September 2012 Publication History

Abstract

Although the k-Nearest Neighbor classifier is one of the most widely-used classification methods, it suffers from the high computational cost and storage requirements it involves. These major drawbacks have constituted an active research field over the last decades. This paper proposes an effective data reduction algorithm that has low preprocessing cost and reduces storage requirements while maintaining classification accuracy at an acceptable high level. The proposed algorithm is based on a fast pre-processing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training-set. Experimental results, based on real-life datasets, illustrate that the proposed algorithm is faster and achieves higher reduction rates than three known existing methods, while it does not significantly reduce the classification accuracy.

References

[1]
D. W. Aha, D. F. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37--66, 1991.
[2]
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, and S. García. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Multiple-Valued Logic and Soft Computing, 17(2--3):255--287, 2011.
[3]
H. Brighton and C. Mellish. Advances in instance selection for instance-based learning algorithms. Data Min. Knowl. Discov., 6(2):153--172, Apr. 2002.
[4]
C.-L. Chang. Finding prototypes for nearest neighbor classifiers. IEEE Trans. Comput., 23(11):1179--1184, Nov. 1974.
[5]
C. H. Chen and A. Jóźwik. A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn. Lett., 17:819--823, July 1996.
[6]
C.-H. Chou, B.-H. Kuo, and F. Chang. The generalized condensed nearest neighbor rule as a data reduction method. In Proceedings of the 18th International Conference on Pattern Recognition - Volume 02, ICPR '06, pages 556--559, Washington, DC, USA, 2006. IEEE Computer Society.
[7]
B. V. Dasarathy. Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, 1991.
[8]
P. Datta and D. F. Kibler. Learning symbolic prototypes. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, pages 75--82, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
[9]
V. S. Devi and M. N. Murty. An incremental prototype set building technique. Pattern Recognition, 35(2):505--513, 2002.
[10]
H. A. Fayed, S. R. Hashem, and A. F. Atiya. Self-generating prototypes for pattern classification. Pattern Recogn., 40:1498--1509, May 2007.
[11]
S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):417--435, 2012.
[12]
G. W. Gates. The reduced nearest neighbor rule. ieee transactions on information theory. IEEE Transactions on Information Theory, 18(3):431--433, 1972.
[13]
M. Grochowski and N. Jankowski. Comparison of instance selection algorithms ii. results and comments. In Artificial Intelligence and Soft Computing - ICAISC 2004, volume 3070 of Lecture Notes in Computer Science, pages 580--585. Springer Berlin/Heidelberg, 2004.
[14]
P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14(3):515--516, 1968.
[15]
N. Jankowski and M. Grochowski. Comparison of instances seletion algorithms i. algorithms survey. In Artificial Intelligence and Soft Computing - ICAISC 2004, volume 3070 of Lecture Notes in Computer Science, pages 598--603. Springer Berlin/Heidelberg, 2004.
[16]
M. Lozano. Data Reduction Techniques in Classification processes (Phd Thesis). Universitat Jaume I, 2007.
[17]
J. McQueen. Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symp. on Math. Statistics and Probability, pages 281--298, Berkeley, CA: University of California Press, 1967.
[18]
R. Mollineda, F. Ferri, and E. Vidal. An efficient prototype merging strategy for the condensed 1-nn rule through class-conditional hierarchical clustering. Pattern Recognition, 35(12):2771--2782, 2002.
[19]
J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, and J. Kittler. A review of instance selection methods. Artif. Intell. Rev., 34(2):133--143, Aug. 2010.
[20]
J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, and J. F. M. Trinidad. A new fast prototype selection method based on clustering. Pattern Anal. Appl., 13(2):131--141, 2010.
[21]
G. Ritter, H. Woodruff, S. Lowry, and T. Isenhour. An algorithm for a selective nearest neighbor decision rule. IEEE Trans. on Inf. Theory, 21(6):665669, 1975.
[22]
H. Samet. Foundations of multidimensional and metric data structures. The Morgan Kaufmann series in computer graphics. Elsevier/Morgan Kaufmann, 2006.
[23]
J. S. Sánchez. High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition, 37(7):1561--1564, 2004.
[24]
G. Toussaint. Proximity graphs for nearest neighbor decision rules: Recent progress. In 34th Symposium on the INTERFACE, pages 17--20, 2002.
[25]
I. Triguero, J. Derrac, S. García, and F. Herrera. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 42(1):86--100, 2012.
[26]
D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE trans. on systems, man, and cybernetics, 2(3):408--421, July 1972.
[27]
D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000.

Cited By

View all
  • (2024)Improving the sustainability of WiFi-enabled indoor localization systems through meta-heuristic based instance selection approachExpert Systems with Applications10.1016/j.eswa.2024.125063257(125063)Online publication date: Jan-2024
  • (2023)Lottery Ticket Search on Untrained Models with Applied Lottery Sample SelectionMachine Learning and Knowledge Extraction10.3390/make50200245:2(400-417)Online publication date: 18-Apr-2023
  • (2023)Reducing Computer Vision Dataset Size via Selective Sampling2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC57777.2023.10422460(1422-1428)Online publication date: 24-Sep-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
BCI '12: Proceedings of the Fifth Balkan Conference in Informatics
September 2012
312 pages
ISBN:9781450312400
DOI:10.1145/2371316
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • MSTD: Ministry of Education, Science and Technological Development - Serbia
  • Novi Sad: Faculty of Technical Sciences, University of Novi Sad

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. k-NN classification
  2. clustering
  3. data reduction

Qualifiers

  • Research-article

Conference

BCI '12
Sponsor:
  • MSTD
  • Novi Sad
BCI '12: Balkan Conference in Informatics, 2012
September 16 - 20, 2012
Novi Sad, Serbia

Acceptance Rates

Overall Acceptance Rate 97 of 250 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving the sustainability of WiFi-enabled indoor localization systems through meta-heuristic based instance selection approachExpert Systems with Applications10.1016/j.eswa.2024.125063257(125063)Online publication date: Jan-2024
  • (2023)Lottery Ticket Search on Untrained Models with Applied Lottery Sample SelectionMachine Learning and Knowledge Extraction10.3390/make50200245:2(400-417)Online publication date: 18-Apr-2023
  • (2023)Reducing Computer Vision Dataset Size via Selective Sampling2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC57777.2023.10422460(1422-1428)Online publication date: 24-Sep-2023
  • (2023)A hybrid tuple selection pipeline for smartphone based Human Activity RecognitionExpert Systems with Applications10.1016/j.eswa.2023.119536217(119536)Online publication date: May-2023
  • (2023)Towards an Analytical Definition of Sufficient DataSN Computer Science10.1007/s42979-022-01549-44:2Online publication date: 7-Jan-2023
  • (2023)LRP-GUS: A Visual Based Data Reduction Algorithm for Neural NetworksArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44192-9_27(337-349)Online publication date: 22-Sep-2023
  • (2021)kD-STR: A Method for Spatio-Temporal Data Reduction and ModellingACM/IMS Transactions on Data Science10.1145/34393342:3(1-31)Online publication date: 18-May-2021
  • (2021)An approach of improving decision tree classifier using condensed informative dataDECISION10.1007/s40622-020-00265-3Online publication date: 28-Jan-2021
  • (2019)Improving Data Reduction by Merging PrototypesAdvances in Databases and Information Systems10.1007/978-3-030-28730-6_2(20-32)Online publication date: 13-Aug-2019
  • (2017)Generating Fixed-Size Training Sets for Large and Streaming DatasetsAdvances in Databases and Information Systems10.1007/978-3-319-66917-5_7(88-102)Online publication date: 25-Aug-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media