[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

Published: 01 January 2005 Publication History

Abstract

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

References

[1]
R. Agrawal T. Imielinski and A. Swami, “Database Mining: A Performance Perspective,” IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 914-925, Dec. 1993.]]
[2]
R. Agrawal and J. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, pp. 962-969, June 1996.]]
[3]
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Databases (VLDB '94), pp. 487-499, Sept. 1994.]]
[4]
K. Alsabti S. Ranka and V. Singh, “Clouds: Classification for Large or Out-of-Core Datasets,” http://www.cise.ufl.edu/ranka/dm.html, 1998.]]
[5]
P. Becuzzi M. Coppola and M. Vanneschi, “Mining of Association Rules in Very Large Databases: A Structured Parallel Approach,” Proc. Europar-99, vol. 1685, pp. 1441-1450, Aug. 1999.]]
[6]
W. Blume R. Doallo R. Eigenman J. Grout J. Hoelflinger T. Lawrence J. Lee D. Padua Y. Paek B. Pottenger L. Rauchwerger and P. Tu, “Parallel Programming with Polaris,” Computer, vol. 29, no. 12, pp. 78-82, Dec. 1996.]]
[7]
R.D. Blumofe C.F. Joerg, et al., “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM Conf. Principles and Practices of Parallel Programming (PPoPP), 1995.]]
[8]
S. Brin R. Motwani J. Ullman and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” Proc. ACM SIGMOD Conf. Management of Data, May 1997.]]
[9]
P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): Theory and Practice,” Advanced in Knowledge Discovery and Data Mining, pp. 61-83, 1996.]]
[10]
I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors,” Proc. Workshop Large-Scale Parallel KDD Systems, in conjunction with the Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 47-56, Aug. 1999.]]
[11]
R. Ferreira G. Agrawal and J. Saltz, “Compiling Object-Oriented Data Intensive Computations,” Proc. 2000 Int'l Conf. Supercomputing, May 2000.]]
[12]
G. Forman and B. Zhang, “Distributed Data Clustering Can be Efficient and Exact,” Proc. SIGKDD Explorations, vol. 2, no. 2, Dec. 2000.]]
[13]
J. Gehrke V. Ganti R. Ramakrishnan and W. Loh, “Boat-Optimistic Decision Tree Construction,” Proc. ACM SIGMOD Conf. Management of Data, June 1999.]]
[14]
J. Gehrke R. Ramakrishnan and V. Ganti, “Rainforest-A Framework for Fast Decision Tree Construction of Large Datasets,” Proc. Conf. Very Large Databases (VLDB), 1998.]]
[15]
S. Goil and A. Choudhary, “Efficient Parallel Classification Using Dimensional Aggregates,” Proc. Workshop Large-Scala Parallel KDD Systems, with ACM SIGKDD-99, Aug. 1999.]]
[16]
S. Goil and A. Choudhary, “PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining,” J. Parallel and Distributed Computing, vol. 61, no. 3, pp. 285-321, Mar. 2001.]]
[17]
E. Gutierrez O. Plata and E.L. Zapata, “A Compiler Method for the Parallel Execution of Irregular Reductions in Scalable Shared Memory Multiprocessors,” Proc. Int'l Conf. Supercomputing (ICS '00), pp.nbsp78-87, May 2000.]]
[18]
M. Hall S. Amarsinghe B. Murphy S. Liao and M. Lam, “Maximizing Multiprocessor Performance with the SUIF Compiler,” Computer, no. 12, Dec. 1996.]]
[19]
E.-H. Han G. Karypis and V. Kumar, “Scalable Parallel Datamining for Association Rules,” Proc. ACM SIGMOD 1997, May 1997.]]
[20]
E-H. Han G. Karypis and V. Kumar, “Scalable Parallel Datamining for Association Rules,” IEEE Trans. Data and Knowledge Eng., vol. 12, no. 3, May/June 2000.]]
[21]
J. Han J. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. ACM SIGMOD Conf. Management of Data, 2000.]]
[22]
J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.]]
[23]
J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufmann, Inc., 1996.]]
[24]
IBM. Db2, “Universal Database Goes Parallel with Enterprise and Enterprise-Extended Editions,” http://www-4.ibm.com/software/data/db2/udb/98eeebrochure, 1999.]]
[25]
A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.]]
[26]
R. Jin and G. Agrawal, “An Efficient Implementation of Apriori Association Mining on Cluster of SMPs,” Proc. Workshop High Performance Data Mining (IPDPS 2001), Apr. 2001.]]
[27]
R. Jin and G. Agrawal, “A Middleware for Developing Parallel Data Mining Implementations,” Proc. First SIAM Conf. Data Mining, Apr. 2001.]]
[28]
R. Jin and G. Agrawal, “Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs,” Proc. ACM SIGMETRICS, June 2002.]]
[29]
M.V. Joshi G. Karypis and V. Kumar, “Scalparc: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets,” Proc. Int'l Parallel Processing Symp., 1998.]]
[30]
A. Kagi, “Mechanisms for Efficient Shared-Memory, Lock-Based Sychronization,” PhD thesis, Univ. of Wisconsin, Madison, 1999.]]
[31]
A. Kagi D. Burger and J.R. Goodman, “Efficient Synchronization: Let Them Eat QOLB,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 170-180, June 1997.]]
[32]
X. Li R. Jin and G. Agrawal, “A Compilation Framework for Distributed Memory Parallelizattion of Data Mining Algorithms,” submitted for publication, 2002.]]
[33]
X. Li R. Jin and G. Agrawal, “Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms,” Proc. Conf. Language and Compilers for Parallel Computing, Aug. 2002.]]
[34]
Y. Lin and D. Padua, “On the Automatic Parallelization of Sparse and Irregular Fortran Programs,” Proc. Workshop Languages, Compilers, and Runtime Systems for Scalable Computers (LCR-98), May 1998.]]
[35]
H. Lu A.L. Cox S. Dwarkadas R. Rajamony and W. Zwaenepoel, “Compiler and Software Distributed Shared Memory Support for Irregular Applications,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 48-56, June 1997.]]
[36]
T. Mahapatra and S. Mishra, Oracle Parallel Processing. O'Reilly Publishers, 2000.]]
[37]
M. Mehta R. Agrawal and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, 1996.]]
[38]
A. Mueller, “Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison,” Technical Report CS-TR-3515, Univ. of Maryland, College Park, Aug. 1995.]]
[39]
S.K. Murthy, “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.]]
[40]
G.J. Narlikar, “A Parallel, Multithreaded Decision Tree Builder,” Technical Report CMU-CS-98-184, School of Computer Science, Carnegie Mellon Univ., 1998.]]
[41]
C.R. Palmer and C. Faloutsos, “Density Biases Sampling: An Improved Method for Data Mining and Clustering,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, June 2000.]]
[42]
J.S. Park M. Chen and P.S. Yu, “An Effecitive Hash Based Algorithm for Mining Association Rules,” Proc. ACM SIGMOD Int'l Conf. Management of Data, May 1995.]]
[43]
S. Parthasarathy M. Zaki and W. Li, “Memory Placement Techniques for Parallel Association Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 1998.]]
[44]
S. Parthasarathy M. Zaki M. Ogihara and W. Li, “Parallel Data Mining for Association Rules on Shared-Memory Systems,” Knowledge and Information Systems, to appear, 2000.]]
[45]
F. Provost and V. Kolluri, “A Survey of Methods for Scaling up Inductive Algorithms,” Knowledge Discovery and Data Mining, vol. 3, 1999.]]
[46]
J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.]]
[47]
S. Ruggieri, “Efficient C4.5,” Technical Report TR-00-01, Dept. of Information, Univ. of Pisa, Feb. 1999.]]
[48]
J.H. Saltz R. Mirchandaney and K. Crowley, “Run-Time Parallelization and Scheduling of Loops,” IEEE Trans. Computers, vol. 40, no. 5, pp. 603-612, May 1991.]]
[49]
A. Savasere E. Omiecinski and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 21st Conf. Very Large Databases (VLDB), 1995.]]
[50]
J. Shafer R. Agrawal and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22nd Int'l Conf. Very Large Databases (VLDB), pp. 544-555, Sept. 1996.]]
[51]
A. Shatdal, “Architectural Considerations for Parallel Query Evaluation Algorithms,” Technical Report CS-TR-1996-1321, Univ. of Wisconsin, 1999.]]
[52]
D.B. Skillicorn, “Strategies for Parallel Data Mining,” IEEE Concurrency, Oct./Dec. 1999.]]
[53]
A. Srivastava E. Han V. Kumar and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Proc. 1998 Int'l Conf. Parallel Processing, 1998.]]
[54]
H. Yu and L. Rauchwerger, “Adaptive Reduction Parallelization Techniques,” Proc. 2000 Int'l Conf. Supercomputing, pp. 66-75, May 2000.]]
[55]
M.J. Zaki C.-T. Ho and R. Agrawal, “Parallel Classification for Data Mining on Shared-Memory Multiprocessors,” Proc. IEEE Int'l Conf. Data Eng., pp. 198-205, May 1999.]]
[56]
M.J. Zaki M. Ogihara S. Parthasarathy and W. Li, “Parallel Data Mining for Association Rules on Shared Memory Multiprocessors,” Proc. Conf. Supercomputing '96, Nov. 1996.]]
[57]
M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,” IEEE Concurrency, vol. 7, no. 4, pp. 14-25, 1999.]]

Cited By

View all
  • (2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
  • (2017)Information fusion from multiple databases using meta-association rulesInternational Journal of Approximate Reasoning10.1016/j.ijar.2016.09.00680:C(185-198)Online publication date: 1-Jan-2017
  • (2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 17, Issue 1
January 2005
143 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2005

Author Tags

  1. 65
  2. Index Terms- Shared memory parallelization
  3. association mining
  4. clustering
  5. decision tree construction.
  6. programming interfaces

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
  • (2017)Information fusion from multiple databases using meta-association rulesInternational Journal of Approximate Reasoning10.1016/j.ijar.2016.09.00680:C(185-198)Online publication date: 1-Jan-2017
  • (2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
  • (2016)Meta-association rules for mining interesting associations in multiple datasetsApplied Soft Computing10.1016/j.asoc.2016.08.01449:C(212-223)Online publication date: 1-Dec-2016
  • (2015)A hybrid construction of a decision tree for multimedia contentsMultimedia Tools and Applications10.1007/s11042-013-1614-674:19(8455-8465)Online publication date: 1-Oct-2015
  • (2015)Parallel data intensive applications using MapReduceCluster Computing10.1007/s10586-014-0405-918:1(403-418)Online publication date: 1-Mar-2015
  • (2014)Fast decision tree-based method to index large DNA-protein sequence databases using hybrid distributed-shared memory programming modelInternational Journal of Bioinformatics Research and Applications10.1504/IJBRA.2014.06076510:3(321-340)Online publication date: 1-Apr-2014
  • (2014)DimmWittedProceedings of the VLDB Endowment10.14778/2732977.27330017:12(1283-1294)Online publication date: 1-Aug-2014
  • (2013)Efficient mining of frequent itemsets in social network data based on MapReduce frameworkProceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/2492517.2500301(1183-1188)Online publication date: 25-Aug-2013
  • (2013)pcAprioriProceedings of the 25th International Conference on Scientific and Statistical Database Management10.1145/2484838.2484879(1-12)Online publication date: 29-Jul-2013
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media