Performance improvement of data mining in Weka through multi-core and GPU acceleration: opportunities and pitfalls

Tiago Augusto Engel¹,
Andrea Schwertner Charão¹,
Manuele Kirsch-Pinheiro² &
…
Luiz-Angelo Steffenel³

509 Accesses
10 Citations
Explore all metrics

Abstract

Data mining tools may be computationally demanding, which leads to an increasing interest on parallel computing strategies in order to improve their performance. While multi-core processors and Graphics Processing Units (GPUs) accelerators increased the computing power of current desktop computers, we observe that desktop-based data mining tools do not take full advantage of these architectures yet. This paper investigates strategies to improve the performance of Weka, a popular data mining tool, through multi-core and GPU acceleration. Using performance profiling of Weka, we identify operations that could improve the data mining performance when parallelized. We selected two of these operations, and analyze the impact of their parallel execution on Weka’s performance. These experiments demonstrate that while significant speedups can be achieved, all operations are not prone to be parallelized, which reinforces the need for a careful and well-studied selection of the candidates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning Algorithm Acceleration Using Hybrid (CPU-MPP) MapReduce Clusters

Accelerating data gravitation-based classification using GPU

Article 05 February 2018

A Parallel Implementation of Relief Algorithm Using Mapreduce Paradigm

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

See KDNuggets polls at http://www.kdnuggets.com/polls/2013/analytics-big-data-mining-data-science-software.html.
http://romeo.univ-reims.fr.
http://romeo.univ-reims.fr.
http://www.top500.org.
http://www.green500.org.
http://cosy.univ-reims.fr/PER-MARE.

References

Andriole SJ, Bojanova I (2014) Optimizing operational and strategic it. IT Prof 16(5):12–15. doi:10.1109/MITP.2014.74
Article Google Scholar
Aparapi (2013) Aparapi team. https://code.google.com/p/aparapi/
Bache K, Lichman M (2013) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
Banerjee DS, Sakurikar P, Kothapalli K (2014) Comparison sorting on hybrid multicore architectures for fixed and variable length keys. Int J High Perform Comput Appl 28(3):267–284. doi:10.1177/1094342014526906. http://hpc.sagepub.com/content/28/3/267.full.pdf
Barry W (2006) Parallel Programming: wTechniques and applications using Networked Workstations And Parallel Computers, 2/E (Pearson Education, 2006), pp. 341–347. ISBN: 9788131702390. http://books.google.co.in/books?id=U_LlqRYYtl0C
Cederman D, Tsigas P (2010) Gpu-quicksort: A practical quicksort algorithm for graphics processors. J Exp Algorithmics 14:4–144124. doi:10.1145/1498698.1564500
Celis S, Musicant DR (2002) Weka-Parallel: Machine Learning in Parallel. Technical report. Carleton College, CS TR
CUDPP (2014) CUDPP: CUDA Data Parallel Primitives Library. http://cudpp.github.io/
De Wael M, Marr S, Van Cutsem T (2014) Fork/Join Parallelism in the Wild: Documenting Patterns and Anti-patterns in Java Programs Using the Fork/Join Framework. In: Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. PPPJ ’14. ACM, New York, pp 39–50. ISBN: 978-1-4503-2926-2. doi:10.1145/2647508.2647511
Dotzler G, Veldema R, Klemm M (2010) JCudaMP: OpenMP/Java on CUDA. In: Proceedings of the 3rd International Workshop on Multicore Software Engineering. IWMSE ’10. ACM, New York, pp 10–17. ISBN: 978-1-60558-964-0. doi:10.1145/1808954.1808959
Engel TA, Charão AS, Kirsch-Pinheiro M, Steffenel LA (2014) Performance Improvement of Data Mining in Weka through GPU Acceleration. In: Proceedings of the 5th International Conference on Ambient Systems, Networks and Technologies (ANT 2014), the 5th International Conference on Sustainable Energy Information Technology (SEIT-2014). Hasselt, Belgium, pp 93–100. doi:10.1016/j.procs.2014.05.402
Fang W, Lau KK, Lu M, Xiao X, Lam CK, Yang PY, He B, Luo Q, Yang PVSK (2008) Parallel data mining on graphics processors, Technical report, Department of Computer Science and Engineering, Hong Kong University of Science and Technology
Fire M, Kagan D, Elyashar A, Elovici Y (2014) Friend or foe? fake profile identification in online social networks. Soc Netw Anal Mining 4(1). doi:10.1007/s13278-014-0194-4
Ghoting A, Kambadur P, Pednault E, Kannan R (2011) NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on Mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’11. ACM, New York, pp 334–342. ISBN: 978-1-4503-0813-7. doi:10.1145/2020408.2020464
Graf F, Kriegel HP, Pölsterl S, Schubert M, Cavallaro A (2011a) Position Prediction in CT Volume Scans. In: Proceedings of the 28th International Conference on Machine Learning (ICML) Workshop on Learning for Global Challenges. Bellevue, Washington
Graf F, Kriegel HP, Schubert M, Pölsterl S, Cavallaro A (2011b) 2D Image Registration in CT Images Using Radial Image Descriptors, in MICCAI (2):607–614
Hailemariam G, Hill S, Demissie S (2012) Exploring Data Mining Techniques and Algorithms for Predicting Customer Loyalty and Loan Default Risk Scenarios at Wisdom Microfinance, Addis Ababa, Ethiopia. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems. MEDES ’12. ACM, New York, pp 183–184. ISBN: 978-1-4503-1755-9. doi:10.1145/2457276.2457310
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
JCublas (2013) Java bindings for CUBLAS. http://www.jcuda.org/jcuda/jcublas/JCublas.html
JCuda (2013) JCuda: Java bindings for CUDA. http://www.jcuda.org/
JCudpp (2014) JCudpp: Java bindings for CUDPP. http://www.jcuda.org/jcuda/jcudpp/JCudpp.html
Jiang W, Agrawal G (2010) MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters. In: Parallel Distributed Processing Symposium (IPDPS), IEEE 26th. International. pp 644–655. doi:10.1109/IPDPS.2012.65
Jones S (2012) How Tesla K20 speeds QuickSort. http://blogs.nvidia.com/blog/2012/09/12/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/
JProbe (2013) JProbe: Java Profiler. http://www.ej-technologies.com/products/jprofiler/overview.html
JProfiler (2013) JProfiler: Java Profiler. http://www.ej-technologies.com/products/jprofiler/overview.html
Khoussainov R, Zuo X, Kushmerick N (2004) Grid-enabled weka: A toolkit for machine learning on the grid. ERCIM News, no. 59. http://www.ercim.eu/publication/Ercim_News/enw59/khussainov.html
Kirschenhofer P, Prodinger H, Martnez C (1997) Analysis of hoare’s find algorithm with median-of-three partition. Random Str Algorithms 10:143–156
Article MATH Google Scholar
Kumar P, Ozisikyilmaz B, Liao WK, Memik G, Choudhary A (2011) High Performance Data Mining Using R on Heterogeneous Platforms. In: Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. pp 1720–1729
Ma W, Agrawal G (2010) AUTO-GC: automatic translation of data mining applications to GPU clusters. In: 24th IEEE International Symposium on Parallel and Distributed Processing—Workshop Proceedings (IEEE Computer Society. pp 1–8
Machado KS, Winck AT, Ruiz DD, Norberto de Souza O (2010) Mining flexible-receptor docking experiments to select promising protein receptor snapshots. BMC Genomics 11(5):1–13
Google Scholar
Markov Z, Russell I (2006) An Introduction to the WEKA Data Mining System. In: Proceedings of the 11th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education. ITICSE ’06. ACM, New York, pp 367–368. ISBN 1-59593-055-8. doi:10.1145/1140124.1140127
Murthy SK (1998) Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining Knowl Discov 2(4):345–389. doi:10.1023/A:1009744630224
Article Google Scholar
Mytkowicz T, Diwan A, Hauswirth M, Sweeney PF (2010) Evaluating the accuracy of Java profilers. In: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation. PLDI ’10. ACM, New York, pp 187–197. ISBN 978-1-4503-0019-3. doi:10.1145/1806596.1806618
Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with cuda. Queue 6(2):40–53. doi:10.1145/1365490.1365500
Oracle Corporation (2013) VisualVM. http://visualvm.java.net
Pérez MS, Sánchez A, Herrero P, Robles V, Peña JM (2005) Adapting the Weka Data Mining Toolkit to a Grid Based Environment. In: Advances in Web Intelligence (AWIC). Lecture Notes in Computer Science, vol. 3528. Springer, Lodz, Polonia, pp 492–497. ISSN: 0302–9743. http://www.gmrv.es/Publications/2005/PSHRP05
Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nature Rev Genetics 11(9):647–657. doi:10.1038/nrg2857. http://www.nature.com/nrg/journal/v11/n9/abs/nrg2857.html
Senger H, Hruschka ER, Silva FAB, Sato LM, Bianchini CP, Jerosch BF (2007) Exploiting idle cycles to execute data mining applications on clusters of pcs. J Syst Softw 80(5):778–790. doi:10.1016/j.jss.2006.05.035
Article Google Scholar
Talia D, Trunfio P, Verta O (2005) Weka4WS: a WSRFenabled Weka Toolkit for Distributed Data Mining on Grids. In: Proc. of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Springer, pp 309–320
Tan G, Li L, Triechle S, Phillips E, Bao Y, Sun N (2011) Fast implementation of DGEMM on Fermi GPU. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’11. ACM, New York, pp 35–13511. ISBN 978-1-4503-0771-0. doi:10.1145/2063384.2063431
Wang D, Irani D, Pu C (2014) Spade: a social-spam analytics and detection framework. Soc Netw Anal Mining 4(1). doi:10.1007/s13278-014-0189-1
Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning. Springer
Witten IH, Frank E, Trigg L, Hall M, Holmes G, Cunningham SJ (1999) Weka: Practical Machine Learning Tools and Techniques with Java Implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES’99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems. pp 192–196
Wu R, Zhang B, Hsu M (2009) GPU-Accelerated Large Scale Analytics, Technical Report HPL-2009-38, HP Labs. http://www.hpl.hp.com/techreports/2009/HPL-2009-38.html
Xu M, Watanachaturaporn P, Varshney PK, Arora MK (2005) Decision tree regression for soft classification of remote sensing data. Remote Sensing Env 97(3):322–336. doi:10.1016/j.rse.2005.05.008. http://www.sciencedirect.com/science/article/pii/S0034425705001604
Zaremba W, Lin Y, Grover V (2012) JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units. In: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. GPGPU-5. ACM, New York, pp 74–83. ISBN 978-1-4503-1233-2
Zhu M, Cao T, Jiang X (2014) Using clustering coefficient to construct weighted networks for supervised link prediction. Soc Netw Anal Mining 4(1). doi:10.1007/s13278-014-0215-3

Download references

Acknowledgments

The authors would like to thank the LSC Laboratory and the ROMEO Computing Center for the access to their resources. This project is partially financed by the STIC-AmSud PER-MARE project\(^{6}\) (project number 13STIC07), an international collaboration program supported by CAPES/MAEE/ANII agencies.

Author information

Authors and Affiliations

Laboratório de Sistemas de Computação, Universidade Federal de Santa Maria, Santa Maria, Brazil
Tiago Augusto Engel & Andrea Schwertner Charão
Centre de Recherche en Informatique, Université Paris 1 Panthéon-Sorbonne, Paris, France
Manuele Kirsch-Pinheiro
Laboratoire CReSTIC, Équipe SysCom, Université de Reims Champagne-Ardenne, Reims, France
Luiz-Angelo Steffenel

Authors

Tiago Augusto Engel
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Schwertner Charão
View author publications
You can also search for this author in PubMed Google Scholar
Manuele Kirsch-Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
Luiz-Angelo Steffenel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiago Augusto Engel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Engel, T.A., Charão, A.S., Kirsch-Pinheiro, M. et al. Performance improvement of data mining in Weka through multi-core and GPU acceleration: opportunities and pitfalls. J Ambient Intell Human Comput 6, 377–390 (2015). https://doi.org/10.1007/s12652-015-0292-9

Download citation

Received: 15 December 2014
Accepted: 20 February 2015
Published: 24 June 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s12652-015-0292-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Learning Algorithm Acceleration Using Hybrid (CPU-MPP) MapReduce Clusters

Accelerating data gravitation-based classification using GPU

A Parallel Implementation of Relief Algorithm Using Mapreduce Paradigm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Performance improvement of data mining in Weka through multi-core and GPU acceleration: opportunities and pitfalls

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Machine Learning Algorithm Acceleration Using Hybrid (CPU-MPP) MapReduce Clusters

Accelerating data gravitation-based classification using GPU

A Parallel Implementation of Relief Algorithm Using Mapreduce Paradigm

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation