[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3351556.3351584acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbciConference Proceedingsconference-collections
short-paper

The Effect of Parallelism on Data Reduction

Published: 26 September 2019 Publication History

Abstract

In this paper, we investigate the effect of parallelism on two data reduction algorithms that use k-Means clustering in order to find homogeneous clusters in the training set. By homogeneous, we refer to clusters where all instances belong to the same class label. Our approach divides the training set into subsets and applies the data reduction algorithm on each separate subset in parallel. Then, the reduced subsets are merged back to the final reduced set. In our experimental study, we split the datasets into 8, 16, 32 and 64 subsets. The results obtained reveal that parallelism can achieve very low preprocessing costs. Also, when the number of subsets is high, in some datasets the accuracy of k-NN classification is almost equal (if not better) to the one achieved when using the standard execution of the reduction algorithms, with a small loss in the reduction rate.

References

[1]
Jesús Alcalá-Fdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, and Salvador García. 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Multiple-Valued Logic and Soft Computing 17, 2-3 (2011), 255--287. http://www.oldcitypublishing.com/journals/mvlsc-home/mvlsc-issue-contents/mvlsc-volume-17-number-2-3-2011/mvlsc-17- 2-3-p-255-287/
[2]
Salvador García, Joaquín Derrac, José Ramón Cano, and Francisco Herrera. 2012. Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Trans. Pattern Anal. Mach. Intell. 34, 3 (2012), 417--435.
[3]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: an update. SIGKDD Explorations 11, 1 (2009), 10--18.
[4]
Ishwarappa and J. Anuradha. 2015. A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology. Procedia Computer Science 48 (2015), 319--324.
[5]
Beniamino Di Martino, Rocco Aversa, Giuseppina Cretella, Antonio Esposito, and Joanna Kolodziej. 2014. Big data (lost) in the cloud. IJBDI 1, 1/2 (2014), 3--17.
[6]
Stefanos Ougiaroglou, Georgios Arampatzis, Dimitris A. Dervos, and Georgios Evangelidis. 2017. Generating Fixed-Size Training Sets for Large and Streaming Datasets. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings. 88--102.
[7]
Stefanos Ougiaroglou and Georgios Evangelidis. 2016. Efficient editing and data abstraction by finding homogeneous clusters. Ann. Math. Artif. Intell. 76, 3-4 (2016), 327--349.
[8]
Stefanos Ougiaroglou and Georgios Evangelidis. 2016. RHC: a non-parametric cluster-based data reduction for efficient k-NN classification. Pattern Anal. Appl. 19, 1 (2016), 93--109.
[9]
Muhammad Habib Ur Rehman, Chee Sun Liew, Assad Abbas, Prem Prakash Jayaraman, Teh Ying Wah, and Samee U. Khan. 2016. Big Data Reduction Methods: A Survey. Data Science and Engineering 1, 4 (2016), 265--284.
[10]
Isaac Triguero, Joaquín Derrac, Salvador García, and Francisco Herrera. 2012. A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. IEEE Trans. Systems, Man, and Cybernetics, Part C 42, 1 (2012), 86--100.
[11]
Dennis L. Wilson. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE trans. on systems, man, and cybernetics 2, 3 (July 1972), 408--421.
[12]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, Boston, MA, USA, June 22, 2010.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
BCI'19: Proceedings of the 9th Balkan Conference on Informatics
September 2019
225 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Technical University of Sofia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clustering
  2. Data Reduction
  3. Parallel Implementation
  4. Prototype Merging
  5. k-NN Classification

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

BCI'19
BCI'19: 9th Balkan Conference in Informatics
September 26 - 28, 2019
Sofia, Bulgaria

Acceptance Rates

BCI'19 Paper Acceptance Rate 24 of 73 submissions, 33%;
Overall Acceptance Rate 97 of 250 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 36
    Total Downloads
  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media