More Web Proxy on the site http://driver.im/

research-article

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

Authors:

Gagan AgrawalAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 1

Pages 71 - 89

https://doi.org/10.1109/TKDE.2005.18

Published: 01 January 2005 Publication History

Abstract

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

References

[1]

R. Agrawal T. Imielinski and A. Swami, “Database Mining: A Performance Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 914-925, Dec. 1993.]]

Digital Library

[2]

R. Agrawal and J. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 962-969, June 1996.]]

Digital Library

[3]

R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Databases (VLDB '94), pp. 487-499, Sept. 1994.]]

Digital Library

[4]

K. Alsabti S. Ranka and V. Singh, “Clouds: Classification for Large or Out-of-Core Datasets,” http://www.cise.ufl.edu/ranka/dm.html, 1998.]]

[5]

P. Becuzzi M. Coppola and M. Vanneschi, “Mining of Association Rules in Very Large Databases: A Structured Parallel Approach,” Proc. Europar-99, vol. 1685, pp. 1441-1450, Aug. 1999.]]

Digital Library

[6]

W. Blume R. Doallo R. Eigenman J. Grout J. Hoelflinger T. Lawrence J. Lee D. Padua Y. Paek B. Pottenger L. Rauchwerger and P. Tu, “Parallel Programming with Polaris,” Computer, vol. 29, no. 12, pp. 78-82, Dec. 1996.]]

Digital Library

[7]

R.D. Blumofe C.F. Joerg, et al., “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM Conf. Principles and Practices of Parallel Programming (PPoPP), 1995.]]

Digital Library

[8]

S. Brin R. Motwani J. Ullman and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” Proc. ACM SIGMOD Conf. Management of Data, May 1997.]]

Digital Library

[9]

P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): Theory and Practice,” Advanced in Knowledge Discovery and Data Mining, pp. 61-83, 1996.]]

Digital Library

[10]

I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors,” Proc. Workshop Large-Scale Parallel KDD Systems, in conjunction with the Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 47-56, Aug. 1999.]]

Digital Library

[11]

R. Ferreira G. Agrawal and J. Saltz, “Compiling Object-Oriented Data Intensive Computations,” Proc. 2000 Int'l Conf. Supercomputing, May 2000.]]

Digital Library

[12]

G. Forman and B. Zhang, “Distributed Data Clustering Can be Efficient and Exact,” Proc. SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.]]

Digital Library

[13]

J. Gehrke V. Ganti R. Ramakrishnan and W. Loh, “Boat-Optimistic Decision Tree Construction,” Proc. ACM SIGMOD Conf. Management of Data, June 1999.]]

Digital Library

[14]

J. Gehrke R. Ramakrishnan and V. Ganti, “Rainforest-A Framework for Fast Decision Tree Construction of Large Datasets,” Proc. Conf. Very Large Databases (VLDB), 1998.]]

Digital Library

[15]

S. Goil and A. Choudhary, “Efficient Parallel Classification Using Dimensional Aggregates,” Proc. Workshop Large-Scala Parallel KDD Systems, with ACM SIGKDD-99, Aug. 1999.]]

Digital Library

[16]

S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining,” J. Parallel and Distributed Computing, vol. 61, no. 3, pp. 285-321, Mar. 2001.]]

Digital Library

[17]

E. Gutierrez O. Plata and E.L. Zapata, “A Compiler Method for the Parallel Execution of Irregular Reductions in Scalable Shared Memory Multiprocessors,” Proc. Int'l Conf. Supercomputing (ICS '00), pp.nbsp78-87, May 2000.]]

Digital Library

[18]

M. Hall S. Amarsinghe B. Murphy S. Liao and M. Lam, “Maximizing Multiprocessor Performance with the SUIF Compiler,” Computer, no. 12, Dec. 1996.]]

Digital Library

[19]

E.-H. Han G. Karypis and V. Kumar, “Scalable Parallel Datamining for Association Rules,” Proc. ACM SIGMOD 1997, May 1997.]]

Digital Library

[20]

E-H. Han G. Karypis and V. Kumar, “Scalable Parallel Datamining for Association Rules,” IEEE Trans. Data and Knowledge Eng., vol. 12, no. 3, May/June 2000.]]

Digital Library

[21]

J. Han J. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Conf. Management of Data, 2000.]]

Digital Library

[22]

J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.]]

Digital Library

[23]

J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufmann, Inc., 1996.]]

Digital Library

[24]

IBM. Db2, “Universal Database Goes Parallel with Enterprise and Enterprise-Extended Editions,” http://www-4.ibm.com/software/data/db2/udb/98eeebrochure, 1999.]]

[25]

A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.]]

Digital Library

[26]

R. Jin and G. Agrawal, “An Efficient Implementation of Apriori Association Mining on Cluster of SMPs,” Proc. Workshop High Performance Data Mining (IPDPS 2001), Apr. 2001.]]

Digital Library

[27]

R. Jin and G. Agrawal, “A Middleware for Developing Parallel Data Mining Implementations,” Proc. First SIAM Conf. Data Mining, Apr. 2001.]]

[28]

R. Jin and G. Agrawal, “Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs,” Proc. ACM SIGMETRICS, June 2002.]]

Digital Library

[29]

M.V. Joshi G. Karypis and V. Kumar, “Scalparc: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets,” Proc. Int'l Parallel Processing Symp., 1998.]]

Digital Library

[30]

A. Kagi, “Mechanisms for Efficient Shared-Memory, Lock-Based Sychronization,” PhD thesis, Univ. of Wisconsin, Madison, 1999.]]

Digital Library

[31]

A. Kagi D. Burger and J.R. Goodman, “Efficient Synchronization: Let Them Eat QOLB,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 170-180, June 1997.]]

Digital Library

[32]

X. Li R. Jin and G. Agrawal, “A Compilation Framework for Distributed Memory Parallelizattion of Data Mining Algorithms,” submitted for publication, 2002.]]

[33]

X. Li R. Jin and G. Agrawal, “Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms,” Proc. Conf. Language and Compilers for Parallel Computing, Aug. 2002.]]

Digital Library

[34]

Y. Lin and D. Padua, “On the Automatic Parallelization of Sparse and Irregular Fortran Programs,” Proc. Workshop Languages, Compilers, and Runtime Systems for Scalable Computers (LCR-98), May 1998.]]

Digital Library

[35]

H. Lu A.L. Cox S. Dwarkadas R. Rajamony and W. Zwaenepoel, “Compiler and Software Distributed Shared Memory Support for Irregular Applications,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 48-56, June 1997.]]

Digital Library

[36]

T. Mahapatra and S. Mishra, Oracle Parallel Processing. O'Reilly Publishers, 2000.]]

Digital Library

[37]

M. Mehta R. Agrawal and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, 1996.]]

Digital Library

[38]

A. Mueller, “Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison,” Technical Report CS-TR-3515, Univ. of Maryland, College Park, Aug. 1995.]]

Digital Library

[39]

S.K. Murthy, “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.]]

Digital Library

[40]

G.J. Narlikar, “A Parallel, Multithreaded Decision Tree Builder,” Technical Report CMU-CS-98-184, School of Computer Science, Carnegie Mellon Univ., 1998.]]

[41]

C.R. Palmer and C. Faloutsos, “Density Biases Sampling: An Improved Method for Data Mining and Clustering,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, June 2000.]]

Digital Library

[42]

J.S. Park M. Chen and P.S. Yu, “An Effecitive Hash Based Algorithm for Mining Association Rules,” Proc. ACM SIGMOD Int'l Conf. Management of Data, May 1995.]]

Digital Library

[43]

S. Parthasarathy M. Zaki and W. Li, “Memory Placement Techniques for Parallel Association Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 1998.]]

Digital Library

[44]

S. Parthasarathy M. Zaki M. Ogihara and W. Li, “Parallel Data Mining for Association Rules on Shared-Memory Systems,” Knowledge and Information Systems, to appear, 2000.]]

Digital Library

[45]

F. Provost and V. Kolluri, “A Survey of Methods for Scaling up Inductive Algorithms,” Knowledge Discovery and Data Mining, vol. 3, 1999.]]

Digital Library

[46]

J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.]]

Digital Library

[47]

S. Ruggieri, “Efficient C4.5,” Technical Report TR-00-01, Dept. of Information, Univ. of Pisa, Feb. 1999.]]

Digital Library

[48]

J.H. Saltz R. Mirchandaney and K. Crowley, “Run-Time Parallelization and Scheduling of Loops,” IEEE Trans. Computers, vol. 40, no. 5, pp. 603-612, May 1991.]]

Digital Library

[49]

A. Savasere E. Omiecinski and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 21st Conf. Very Large Databases (VLDB), 1995.]]

Digital Library

[50]

J. Shafer R. Agrawal and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22nd Int'l Conf. Very Large Databases (VLDB), pp. 544-555, Sept. 1996.]]

Digital Library

[51]

A. Shatdal, “Architectural Considerations for Parallel Query Evaluation Algorithms,” Technical Report CS-TR-1996-1321, Univ. of Wisconsin, 1999.]]

[52]

D.B. Skillicorn, “Strategies for Parallel Data Mining,” IEEE Concurrency, Oct./Dec. 1999.]]

Digital Library

[53]

A. Srivastava E. Han V. Kumar and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Proc. 1998 Int'l Conf. Parallel Processing, 1998.]]

Digital Library

[54]

H. Yu and L. Rauchwerger, “Adaptive Reduction Parallelization Techniques,” Proc. 2000 Int'l Conf. Supercomputing, pp. 66-75, May 2000.]]

Digital Library

[55]

M.J. Zaki C.-T. Ho and R. Agrawal, “Parallel Classification for Data Mining on Shared-Memory Multiprocessors,” Proc. IEEE Int'l Conf. Data Eng., pp. 198-205, May 1999.]]

Digital Library

[56]

M.J. Zaki M. Ogihara S. Parthasarathy and W. Li, “Parallel Data Mining for Association Rules on Shared Memory Multiprocessors,” Proc. Conf. Supercomputing '96, Nov. 1996.]]

Digital Library

[57]

M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,” IEEE Concurrency, vol. 7, no. 4, pp. 14-25, 1999.]]

Digital Library

Cited By

Djenouri YLin JNørvåg KRamampiaro HYu P(2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
https://dl.acm.org/doi/10.1145/3439771
Ruiz MGómez-Romero JMolina-Solana MRos MMartin-Bautista M(2017)Information fusion from multiple databases using meta-association rulesInternational Journal of Approximate Reasoning10.1016/j.ijar.2016.09.00680:C(185-198)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.ijar.2016.09.006
Saeed ARauf AKhusro SMahfooz S(2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.1145/2908446.2908457
Show More Cited By

Index Terms

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

Recommendations

A novel Boolean algebraic framework for association and pattern mining

Data mining has been defined as the non- trivial extraction of implicit, previously unknown and potentially useful information from data. Association mining and sequential mining analysis are considered as crucial components of strategic control over a ...
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test ...
Re-mining item associations: Methodology and a case study in apparel retailing

Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 17, Issue 1

January 2005

143 pages

ISSN:1041-4347

Issue’s Table of Contents

Copyright © 2005.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2005

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Djenouri YLin JNørvåg KRamampiaro HYu P(2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
https://dl.acm.org/doi/10.1145/3439771
Ruiz MGómez-Romero JMolina-Solana MRos MMartin-Bautista M(2017)Information fusion from multiple databases using meta-association rulesInternational Journal of Approximate Reasoning10.1016/j.ijar.2016.09.00680:C(185-198)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.ijar.2016.09.006
Saeed ARauf AKhusro SMahfooz S(2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.1145/2908446.2908457
Ruiz MGmez-Romero JMolina-Solana MCampaa JMartin-Bautista M(2016)Meta-association rules for mining interesting associations in multiple datasetsApplied Soft Computing10.1016/j.asoc.2016.08.01449:C(212-223)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.asoc.2016.08.014
Nasridinov AIhm SPark Y(2015)A hybrid construction of a decision tree for multimedia contentsMultimedia Tools and Applications10.1007/s11042-013-1614-674:19(8455-8465)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1007/s11042-013-1614-6
Han LOng H(2015)Parallel data intensive applications using MapReduceCluster Computing10.1007/s10586-014-0405-918:1(403-418)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1007/s10586-014-0405-9
Jaber KAbdullah RRashid N(2014)Fast decision tree-based method to index large DNA-protein sequence databases using hybrid distributed-shared memory programming modelInternational Journal of Bioinformatics Research and Applications10.1504/IJBRA.2014.06076510:3(321-340)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1504/IJBRA.2014.060765
Zhang CRé C(2014)DimmWittedProceedings of the VLDB Endowment10.14778/2732977.27330017:12(1283-1294)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.14778/2732977.2733001
Farzanyar ZCercone NRokne JFaloutsos C(2013)Efficient mining of frequent itemsets in social network data based on MapReduce frameworkProceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/2492517.2500301(1183-1188)Online publication date: 25-Aug-2013
https://dl.acm.org/doi/10.1145/2492517.2500301
Schlegel BKiefer TKissinger TLehner W(2013)pcAprioriProceedings of the 25th International Conference on Scientific and Statistical Database Management10.1145/2484838.2484879(1-12)Online publication date: 29-Jul-2013
https://dl.acm.org/doi/10.1145/2484838.2484879
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents