short-paper

The Effect of Parallelism on Data Reduction

Authors:

Pavlos Ponos,

Stefanos Ougiaroglou,

Georgios EvangelidisAuthors Info & Claims

BCI'19: Proceedings of the 9th Balkan Conference on Informatics

Article No.: 33, Pages 1 - 4

https://doi.org/10.1145/3351556.3351584

Published: 26 September 2019 Publication History

Get Access

Abstract

In this paper, we investigate the effect of parallelism on two data reduction algorithms that use k-Means clustering in order to find homogeneous clusters in the training set. By homogeneous, we refer to clusters where all instances belong to the same class label. Our approach divides the training set into subsets and applies the data reduction algorithm on each separate subset in parallel. Then, the reduced subsets are merged back to the final reduced set. In our experimental study, we split the datasets into 8, 16, 32 and 64 subsets. The results obtained reveal that parallelism can achieve very low preprocessing costs. Also, when the number of subsets is high, in some datasets the accuracy of k-NN classification is almost equal (if not better) to the one achieved when using the standard execution of the reduction algorithms, with a small loss in the reduction rate.

References

[1]

Jesús Alcalá-Fdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, and Salvador García. 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Multiple-Valued Logic and Soft Computing 17, 2-3 (2011), 255--287. http://www.oldcitypublishing.com/journals/mvlsc-home/mvlsc-issue-contents/mvlsc-volume-17-number-2-3-2011/mvlsc-17- 2-3-p-255-287/

Google Scholar

[2]

Salvador García, Joaquín Derrac, José Ramón Cano, and Francisco Herrera. 2012. Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Trans. Pattern Anal. Mach. Intell. 34, 3 (2012), 417--435.

Digital Library

Google Scholar

[3]

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: an update. SIGKDD Explorations 11, 1 (2009), 10--18.

Digital Library

Google Scholar

[4]

Ishwarappa and J. Anuradha. 2015. A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology. Procedia Computer Science 48 (2015), 319--324.

Crossref

Google Scholar

[5]

Beniamino Di Martino, Rocco Aversa, Giuseppina Cretella, Antonio Esposito, and Joanna Kolodziej. 2014. Big data (lost) in the cloud. IJBDI 1, 1/2 (2014), 3--17.

Crossref

Google Scholar

[6]

Stefanos Ougiaroglou, Georgios Arampatzis, Dimitris A. Dervos, and Georgios Evangelidis. 2017. Generating Fixed-Size Training Sets for Large and Streaming Datasets. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings. 88--102.

Google Scholar

[7]

Stefanos Ougiaroglou and Georgios Evangelidis. 2016. Efficient editing and data abstraction by finding homogeneous clusters. Ann. Math. Artif. Intell. 76, 3-4 (2016), 327--349.

Digital Library

Google Scholar

[8]

Stefanos Ougiaroglou and Georgios Evangelidis. 2016. RHC: a non-parametric cluster-based data reduction for efficient k-NN classification. Pattern Anal. Appl. 19, 1 (2016), 93--109.

Digital Library

Google Scholar

[9]

Muhammad Habib Ur Rehman, Chee Sun Liew, Assad Abbas, Prem Prakash Jayaraman, Teh Ying Wah, and Samee U. Khan. 2016. Big Data Reduction Methods: A Survey. Data Science and Engineering 1, 4 (2016), 265--284.

Crossref

Google Scholar

[10]

Isaac Triguero, Joaquín Derrac, Salvador García, and Francisco Herrera. 2012. A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. IEEE Trans. Systems, Man, and Cybernetics, Part C 42, 1 (2012), 86--100.

Digital Library

Google Scholar

[11]

Dennis L. Wilson. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE trans. on systems, man, and cybernetics 2, 3 (July 1972), 408--421.

Google Scholar

[12]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, Boston, MA, USA, June 22, 2010.

Digital Library

Google Scholar

Index Terms

The Effect of Parallelism on Data Reduction

Recommendations

Efficient dataset size reduction by finding homogeneous clusters
BCI '12: Proceedings of the Fifth Balkan Conference in Informatics

Although the k-Nearest Neighbor classifier is one of the most widely-used classification methods, it suffers from the high computational cost and storage requirements it involves. These major drawbacks have constituted an active research field over the ...
RHC: a non-parametric cluster-based data reduction for efficient $$k$$k-NN classification

Although the $$k$$k-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient $$k$$k-NN ...
Reduction Through Homogeneous Clustering: Variations for Categorical Data and Fast Data Reduction
Abstract
Reduction through Homogeneous Clustering (RHC) and its editing variant (ERHC) represent effective methods for reducing data in the context of instance-based classification. Both RHC and ERHC are based on an iterative k-means clustering procedure ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

BCI'19: Proceedings of the 9th Balkan Conference on Informatics

September 2019

225 pages

ISBN:9781450371933

DOI:10.1145/3351556

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Technical University of Sofia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

BCI'19

BCI'19: 9th Balkan Conference in Informatics

September 26 - 28, 2019

Sofia, Bulgaria

Acceptance Rates

BCI'19 Paper Acceptance Rate 24 of 73 submissions, 33%;

Overall Acceptance Rate 97 of 250 submissions, 39%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
36
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Efficient dataset size reduction by finding homogeneous clusters

RHC: a non-parametric cluster-based data reduction for efficient $$k$$k-NN classification

Reduction Through Homogeneous Clustering: Variations for Categorical Data and Fast Data Reduction