A comparative study of Distributed Large Scale Data Mining Algorithms
Isha Sood1, Dr.Varsha Sharma2
1School of Information Technology,RGPV Bhopal.
2 Assistant Professor School of Information
Technology, RGPV Bhopal. (India)
Email:ishasweet1984@gmail.com,varsha.sharma@rgtu.net
ABSTRACT
Essentially, data mining concerns the
computation of data and the identification of patterns and trends in the
information so that we might decide or judge. Data mining concepts have been in
use for years, but with the emergence of big data, they are even more common.
In particular, the scalable mining of such large data sets is a difficult issue
that has attached several recent findings. A few of these recent works use the
MapReduce methodology to construct data mining models across the data set. In
this article, we examine current approaches to large-scale data mining and
compare their output to the MapReduce model. Based on our research, a system
for data mining that combines MapReduce and sampling is implemented and
addressed.
Keywords: MapReduce, large-scale data mining
INTRODUCTION
Big data caused an explosion in the use of more extensive data mining
techniques, partially becausethe size of the information is much larger and
because the information tends to be more varied and extensive in its very
nature and content. With large data sets, it is no longer enough to get
relatively simple and straightforward statistics out of the system. With 30 or
40 million records of detailed customer information, knowing that two million
of them live in one location is not enough. We want to know whether those two
million are a particular age group and their average earnings so that you can
target your customer needs better.
These business-driven needs changed simple data retrieval and statistics
into more complex data mining. The business problem drives an examination of
the data that helps to build a model to describe the information that
ultimately leads to the creation of the resulting report. Figure 1 outline.
Figure1.
Outline of the process
The process of data analysis, discovery, and model-building is often
iterative as you target and identify the different information that you can
extract. You must also understand how to relate, map, associate, and cluster it
with other data to produce the result. Identifying the source data and formats,
and then mapping that information to our given result can change after you
discover different elements and aspects of the data.
I. Tools of Data Mining
Consequently, large-scale data mining has attracted tremendous interest
in data mining community [7, 14, 9]. Based on this research, three alternatives
have emerged for dealing with massive data sets:
Building association or
relation-based data mining tools can be achieved simply with different tools.
For example, within InfoSphere Warehouse a wizard provides configurations of an
information flow that is used in association by examining our database input
source, decision basis, and output information
·
Clustering:This method basically divides the objects into the clusters. It
considers data tuples as an object. In the example, we can identify two clusters, clustering allows us to
different cluster using the two different classification group. One around the
US$2,000/20-30 age group, and another at the US$7,000-8,000/50-65 age group. In
this case, we've both hypothesis
·
Hypothesized and proved our hypothesis with a simple graph that we can
create using any suitable graphing software for a quick manual view. More complex
determinations require a full analytical package, especially if you want to
automatically base decisions onnearest neighbour information.
Figure 2: Clustering Algorithm
·
Classification:it classifies the small problem into one group.We
can use classification to build up an idea of the type of customer, item, or
object by describing multiple attributes to identify a particular class. For
example, you can easily classify cars into different types (Benz, 4x4,
convertible) by identifying different attributes (number of seats, car shape,
driven wheels). Given a new car, you might apply it into a particular class by
comparing the attributes with our known definition. You can apply the same
principles to customers, for example by classifying them by age and social
group.
Reduction is about
summarizing or quantifying the information and then outputting that information
in a standardized structure that is based upon the totals, sums, statistics, or
other analysis that you selected for output.
Querying this data is often
complex, even when you use tools designed to do so. Within a data mining
exercise, the ideal approach is to use the MapReduce phase of the data mining
as part of your data preparation exercise.
For example, if you are
building a data mining exercise for association or clustering, the best first
stage is to build a suitable statistic model that you can use to identify and
extract the necessary information. Use the MapReduce phase to extract and
calculate that statistical information then input it to the rest of the data
mining process, leading to a structure such as the one shown in
This paper compares
the three main large scale data. This paper is structured as follows: in
section 2, an overview of related works in large scale data mining is given. In
section 3, the MapReduceframework and data mining algorithm used in our work
are detailed. In section 4, the three different approaches used in this papers
experiment for the large-scale data mining are discussed. The experimental
results are provided in section 5. In section 6, a new framework that combines
sampling and MapReduce is explained. Finally conclusions are drawn and future work is outlined in section
In the last decades, we have seen that a lot
of work has done on large scale data mining. Sp many methods has been developed
for solve the problem of this issue. Like as MapReduce method gives us facility
to divide the data using the Map() function. The map function is used at so
many places like as Google, Microsoft,Hotmail or other products of Microsoft. Now
a days Researchers used this function for solving the datamining problems, in
our reference presented a good algorithm for the large scale datamining.[12].
Our work focuses on the comparison of different methods of large scale
datamining. Also we will find here some good datamining schemes.
Different
Approaches of DataMining
i.
MapReduce
MapReduce method is basically based on two different terminologies 1 is
map function and another is reduce function. Map function uses for swapping, comparing and sorting the data
either in ascending order or in descending order while Reduce()
function know about the number the data and types data. The "MapReduce System" (also called
"infrastructure", "framework") Map function works first it
takes data and apply the sorting method than reduce procedure count that data. This
system motivated from the map and reduces function. By which research solve the
large sale data mining problems.
MapReduce algorithms apply in so many places like sorting the data,
large scale data machine or data mining projects. Everywhere it uses same
technique to solve the data mining issue. By the figure it is easily be
understand that map function uses first than reduce function started to its
work.
Figure 3: MapReduce structure
.
ii.
Support
Vector Machine
Support vector machines (SVM) are supervised
learning models with associated learning algorithms that analyse
data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given
input, which of two possible classes forms the output, making it a non-probabilistic binary linear
classifier. Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that assigns new
examples into one category or the other. An SVM model is a representation of
the examples as points in space, mapped so that the examples of the separate
categories are divided by a clear gap that is as wide as possible. New examples
are then mapped into that same space and predicted to belong to a category
based on which side of the gap they fall on.
iii.
Naive Bayes Classifier
This method is based on the random experiments and it is the
probabilistic approach. It depends on the finite sets of possible class labels.
This method depends on the Bayes Theorem. This is the probabilistic approach
This approach assumes that the effect of an attribute of a given class is
independent of the attributes of other values..
In this method:
Vx,y
= argmaxvj ε v P (Vj)
∑ P(aj|vj) ..............
(1)
We
generally estimate P(ai|vj) usingfollowing formula:
P(ai|vj)
= (nc + mp) / (n + m) ……………... (2)
where:
n
= the number of training examples for which v = vj
nc
= number of examples for which v = vj and a = ai
p
= a priori estimate for P(ai|vj)
m
= the equivalent sample sizeIn a text
This classifier used the SVM classifier. It depends on the probability
theory of random experiment. For example if there are so many cars theft from
one city than these types of data can be resolve by the naive bayes classifier.
Models
FOR LARGE-SCALE DATA MINING
·
Sampling Model: Sampling Model is based on Naive Bayes Classifier. It resolves the
maximum number of data.In this model we divide the data into the
different samples by which we will make them between the different types of relation
like Association or Classification.
·
Ensemble model:This model us the group based model, we divide the whole training set
into several sectors in which there are so many number of activities which is
called groups it distributes the same
problem in the numerous data sets. The idea of the ensemble model is to build
individual classifier on each group and use each classifier to classify the
testing files, then use majority vote to decide the final category of each file.
.
Figure
4: Ensemble approach
·
HadoopModel:
Distributed computing model gives us the
capability, scalability. Hadoop gave a framework for distributed computing mode
by which we can build the classifier for large data.The model builds a classifier
on the entire training set and the classifier is evaluated against the testing
set. Both the training and evaluation process can be parallelized inMapReduce
framework.
EXPERIMENTAL RESULTS
i)
Data Set
Our purpose here to comprise the data sets by the different methods. But
there is a problem that from where we have to get the large number of data by
which we can apply these algorithms so we use the web for solving this problem.
So many ODP's (Open directory Projects) websites are there where we can find
large number of data and then create the graphs and compare these methods
procedures. Now we are using the botw.com (best of the web) for the large
number of data. In this we are using the 60,000 different items and making the
different types of cluster using the different methods data mining. This
experiment will clear the image of the best method for this data mining
approaches.
ii)
Data Preprocessing
Before we run topic classification, all html pages are pre-processed by
our text pre-processor. The text pre-processor cleans the data for the training
and classification usage by
(1)
Removing all html
tags and extracting the content from the \title" and \description"
metatags, and any meaningful content in the body" section.
(2) Removing stop words and any symbol other than the number or letters.
After the step of pre-processing, every html file has been converted to a text
file with a stream of terms. Since our purpose of the experiment is not to
investigate the methodology of the web page classification, we prefer to study
the accuracy of the classification of simple text files rather than complex
structured html files. Therefore, in data pre-processing phase, we remove all
the html related features and only keep the meaningful and important
information presented in html pages. In the next part of the experiment, the
classification task can be seen as a text categorization task.
iii)
Software for Experiment
In this experiment we use the stata software for verifying our project.
This will be used to track the data of botw.com. This computes the possibilities of data
provided by the native byes classifier.
This find the accuracy and then create a diagram for comparison of
different types data.
There are two different methods shue and chop [28]. The first one based on the random experiment
as well as second one based on the distributed algorithms of probability. These
two methods help to find the better result of this problem.
To create the structure of naive bayes classifier we use the mahout or
other software which can easily do clustering.
iv)
Results
The Naive Bayes classifier has been proven a very simple and effective
text classifier. In our experiment, we deploy the Naive Bayes classifier in the
three models: Hadoop model, sampling model and ensemble model. We run each
experiment 10 times. In each round, we randomly select 50% of the files from
the whole data set as training set and remaining 50% of the files as the test
set. The three models respectively use the training set to train their
classifiers and evaluate the classifiers again the test set. Then we get the
average accuracy of each model. We use classification accuracy as our accuracy
measures.
Figure 5: Sampling
data mining accuracy
1. Sampling Model
In sampling model, we randomly select the data from the training set and
use the sampled data set to train the Naive Bayes classifier. In order to
investigate how accuracy varies with different sample sizes, in each round, we
build four classifiers using four different sample sizes. The four sample sizes
are 1,000, 2,000, 5,000 and 10,000 files from each of the eight categories.
Figure 3 shows the accuracy of the models in the four sample sizes. From Figure
3, we can see the classifier built on sampling size of 10k has the highest
accuracy in each round. The total average accuracy of 10 rounds sampling model
improves from 64:24% to 67% by increasing the sampling data set from 1k to 10k files
from each category. Unsurprisingly, the result indicates that increasing
sampling size improves model accuracy.
2. Ensemble Model
In this model, we set the sub-model number to 5, 10, 15 and run the
evaluations for every model in each round. Figure 4 shows the accuracy of the
10 rounds of 5, 10, 15 sub-models ensemble models. From Figure 4, we can see
the classifier using 5 sub-models ensemble outperforms the other two
classifiers which use 10 and 15 sub-models ensemble respectively. The total
average accuracy of 10 rounds sub-model ensembles decreases from 69:14% to
68:48% by increasing the sub model numbers from 5 to 15. This shows each
sub-model built on more data producing higher accuracy leads to more accurate
final result. However, the difference is slight.
Figure
6: Ensemble data mining accuracy
1. Hadoop
Model
In our Hadoop cluster, the Master node and 22 Slaves nodes are running
Ubuntu Linux. The Master is configured with Intel(R) Pentium(R) 4 CPU 2:80GHz
and 4G Memories. All the Slaves are configured with Intel(R) Pentium(R) 4 CPU
3:20GHz and 4G Memory. For each round of classification, the model builds a
classifier on the entire training set and the classifier is evaluated against
the testing set. Table 2 shows the accuracy of 10 rounds evaluation in details.
The average accuracy of 10 rounds evaluation is 68:34%.
We select the best performed classifiers in sampling and ensemble
approaches, which is the one built on sampling size of 10k and 5 sub-models
ensemble classifier, to compare with the Hadoop model. In Figure 5, we can see
ensemble approach overall has the highest accuracy among the three approaches.
But this approaches will be the model builds a classifier on the entire
training set and the classifier is evaluated against the testing set. Table 2
shows the accuracy of 10 rounds evaluation in details. The average accuracy of
10 rounds evaluation is 68:34%. We select the best performed classifiers in
sampling and ensemble approaches, which is the one built on sampling size of
10k and 5 sub-models ensemble classifier, to compare with the Hadoop model. In Figure
5, we can see ensemble approach overall has the highest accuracy among the
three approaches. But the accuracy of the three approaches are quite close.
Figure
7: Three approaches data mining accuracy&MapReduce
Classifier Training and Evaluation Procedure
Table1: Numbers of HTML Pages from
Eight Categories.
Category |
Art |
Business |
Computer |
Game |
Health |
Home |
Science |
Society |
Number |
176; 340 |
188; 100 |
88; 830 |
39; 560 |
43; 680 |
22; 281 |
85; 197 |
81;
620 |
Table2: Data mining accuracy of
Hadoop approach.
Category |
Round1 |
Round2 |
Round3 |
Round4 |
Round5 |
Round6 |
Round7 |
Round8 |
Round9 |
Round10 |
Art |
80:09% |
80:12% |
81:18% |
80:31% |
80:20% |
80:17% |
80:86% |
79:70% |
79:79% |
80:06% |
Business |
55:82% |
58:43% |
53:77% |
57:30% |
57:13% |
59:62% |
54:98% |
58:36% |
59:15% |
57:98% |
Computer |
82:37% |
81:93% |
82:88% |
82:68% |
82:48% |
82:18% |
82:81% |
82:22% |
81:69% |
82:14% |
Game |
78:83% |
78:84% |
76:86% |
77:70% |
78:45% |
78:94% |
78:17% |
79:78% |
78:49% |
79:02% |
Health |
78:77% |
80:49% |
79:68% |
80:42% |
79:95% |
80:33% |
79:76% |
79:71% |
80:14% |
81:27% |
Home |
68:01% |
66:78% |
67:97% |
66:41% |
67:16% |
67:12% |
67:49% |
67:13% |
66:16% |
67:44% |
Science |
48:47% |
50:64% |
49:98% |
50:19% |
49:26% |
47:89% |
48:34% |
49:32% |
49:62% |
48:53% |
Society |
63:07% |
61:00% |
61:10% |
61:62% |
61:77% |
61:50% |
61:92% |
62:16% |
62:16% |
61:39% |
The experiments show the classifiers built on the entire data set have
better accuracy than the one built on sampled data set. It is not surprising to
see that using the entire data set to build the model will achieve better
outcome than sampling. Undoubtedly, MapReduce framework grants researchers a
simple option to implement parallel and distributed data mining algorithms on
large-scale data set. However, some data mining techniques such as Neural
networks may not be easy to implement in MapReduce framework. At the same time,
by increasing the sampling ratio, we achieve similar accuracy. However, this
observation has some limitations. For example, for other classification
algorithms, it is not clear whether we will observe similar results.
Choosing an appropriate sampling method and size is a key factor that
determines the quality of the classifier in sampling approach. Any bad quality
samples will generate a poor classifier. A good sampling method should have the
ability of sampling data that is representative of the entire data set. Another
important issue is the sample size. For Naive Bayes, using Hoeffing inequality
[22], we can easily compute the sample size that we need for accurately
estimating f(wi; Cj). Basically, with sample size of fifty thousand, relative
error in frequency estimate will be less than 0.007. Of course, doing such
analysis for other data mining tasks could be harder.
In our research work, we realized that it will take a lot of efforts to
implement every single data mining algorithm in MapReduce fashion, and some
algorithms such as Neural networks may be very difficult to implement in
distributed fashion. Meanwhile, as our experiments indicate the sampling
approach could yield acceptable data mining accuracy. To fully take advantage
of MapReduce and to reduce the implementation time, we propose to utilize
MapReduce in the sampling phase. In this way, we only need to implement
sampling algorithm using MapReduce, and fully use the scalability of MapReduce
to efficiently explore through the entire data set in the sampling phase, then
feed the sampled data to the data mining tools, such as WEKA [34], to build
data mining model. To get a good data mining model, we could design an
important feedback loop as shown in Figure 6.
The main idea of this framework is to utilize the MapReduce technique to
continuously adjust the sampled data and the size of the sampling until we
produce a good data mini model. In Figure 6, you can see, the system starts
with initially setting the sampling times threshold and the sampling iteration
limit which the existing in-memory data mining toolkits can take, and then
enter the self-adaptive feedback sampling loop. In this loop, the system uses
the entire data set for sampling purpose. The MapReduce component takes the
entire data set as input, and the map function serves as the sampling function
which either chooses the instance into the training set or discards it
according to the particular sampling method the system follows.
The reduce function outputs all selected instances as training set.
Whenever a sampled data set is retrieved successfully, the system calls the
data mining toolkit, e.g. WEKA to build a model. The system tests the model
with the data that did not appear in the sampled data set, and then check if
the accuracy has significantly improved compared to previous sample. If the
model's accuracy has improved significantly, the system does another round of
sampling until there is no substantial improvement in accuracy, or it reaches
the sampling iteration threshold.
The system has the following advantages.
CONCLUSIONS
AND FUTURE WORK
In this paper, we explored three
large-scale data mining approaches by using a real-world large-scale data set.
We used the same classification algorithm Naive Bayes with different methods
dealing with training set. We compared the ac-curacy of three approaches in a
series of experiments. The result of our experiments shows using more data in
training could build more accurate model. As sampling size in-creases, the
sampling model also can get similar accuracy as the ones built on the entire
data set. Furthermore, we explored the relationship between model accuracy with
sampling size in sampling model, and with data partitioning size in ensemble
model.
Based on our observations, we
proposed an idea of utilizing MapReduce framework to help improve the accuracy
and efficiency of sampling. In the future, we plan to implement and explore our
new framework by integrating with different data mining toolkits and evaluating
with different data sets, and then compare it with existing large-scale data
mining approaches.
REFERENCES
1. J. Bacardit and X. Llorµa. Large scale data mining
using genetics-based machine learning. In GECCO '09:Proceedings of the 11th
Annual Conference Companion on Genetic and Evolutionary Computation Conference,pages
3381{3412, New York, NY, USA, 2009. ACM.
2. H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and
3. H. Li. Context-aware query suggestion by mining
click-through and session data. In KDD '08: Proceeding of the 14th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 875{883,
New York, NY, USA, 2008. ACM.
4. E. Y. Chang, H. Bai, and K. Zhu. Parallel algorithms
for mining large-scale rich-media data. In MM '09: Proceedings of the seventeen
ACM international conference on Multimedia, pages 917{918, New York, NY, USA,
2009. ACM.
5. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a
distributed storage system for structured data. In OSDI '06: Proceedings of the
7th symposium on Operating systems design and implementation, pages 205{218,
Berkeley, CA, USA, 2006. USENIX Association.
6. N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P.
Kegelmeyer. Learning ensembles from bites: A scalable and accurate approach. J.
Mach. Learn. Res., 5:421{451, 2004.
7. C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski,
A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. pages
281{288. MIT Press, 2006.
8. A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google
news personalization: scalable online collaborative ¯ltering. In WWW '07:
Proceedings of the 16th international conference on World Wide Web, pages
271{280, New York, NY, USA, 2007. ACM.
9. J. Dean and S. Ghemawat. Mapreduce: simpli¯ed data
processing on large clusters. In OSDI'04: Proceedings of the 6th conference on
Symposium on Opearting Systems Design & Implementation, pages 10{10,
Berkeley, CA, USA, 2004. USENIX Association.
10. C. Domingo, R. Gavaldµa, and O. Watanabe. Adaptive
sampling methods for scaling up knowledge discovery algorithms. Data Min.
Knowl. Discov., 6(2):131{152, 2002.
11. S. Ghemawat, H. Gobio®, and S. Leung. The google ¯le
system. SIGOPS Oper. Syst. Rev., 37(5):29{43, 2003.
12. D. Gillick, A. Faria, and J. DeNero. Mapreduce:
Distributed computing for machine learning.
13. R. Grossman and Y. Gu. Data mining using high
performance data clouds: experimental studies using sector and sphere. In KDD
'08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 920{927, New York, NY, USA, 2008. ACM.
14. B. GU, F. HU, and H. LIU. Sampling and its application
in data mining: A survey. 2000.
15. Y. Gu and R. Grossman. Sector and sphere: The design
and implementation of a high performance data cloud. Theme Issue of the
Philosophical Transactions of the Royal Society, E-Science and Global
E-Infrastructure, 367:2429{2455, 2009.
16. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T.
Wang. Mars: a mapreduce framework on graphics processors. In PACT '08:
Proceedings of the 17th International conference on Parallel architectures and
compilation techniques, pages 260{269, New York, NY, USA, 2008. ACM.
17. W. Hoe®ding. Probability inequalities for sums of
bounded random variables. In Journal of the American Statistical Association,
pages 58:13{30, 1963.
18. M. Kearns. E±cient noise-tolerant learning from
statistical queries. J. ACM, 45(6):983{1006, 1998.
19. J. Lee and S. Cha. Page-based anomaly detection in
large scale web clusters using adaptive mapreduce (extended abstract). In RAID
'08: Proceedings of the 11th international symposium on Recent Advances in
Intrusion Detection, pages 404{405, Berlin, Heidelberg, 2008. Springer-Verlag.
20. A. McCallum and K. Nigam. A comparison of event models
for naive bayes text classi¯cation, 1998.
21. M. Mehta, R. Agrawal, and J. Rissanen. Sliq: A fast
scalable classifier for data mining. In EDBT '96: Proceedings of the 5th
International Conference on Extending Database Technology, pages 18{32, London,
UK, 1996. Springer-Verlag.
22. T. M. Mitchell. Machine Learning. mcgraw-hill, 1997.
23. C. Moretti, K. Steinhaeuser, D. Thain, and N. V.
Chawla. Scaling up classifiers to cloud computers. In ICDM '08: Proceedings of
the 2008 Eighth IEEE International Conference on Data Mining, pages 472{481,
Washington, DC, USA, 2008. IEEE Computer Society.
24. B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo.
Planet: Massively parallel learning of tree ensembles with mapreduce. Lyon,
France, 2009. ACM press.
25. S. Papadimitriou and J. Sun. Disco: Distributed
co-clustering with map-reduce: A case study towards petabyte-scale end-to-end
mining. In ICDM '08:
26. Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining, pages 512{521, Washington, DC, USA, 2008. IEEE
Computer Society.
27. F. Provost, V. Kolluri, and U. Fayyad. A survey of
methods for scaling up inductive algorithms. Data Mining and Knowledge
Discovery, 3:131{169, 1999).
28. X. Qi and B. D. Davison. Web page classi¯cation:
Features and algorithms. ACM Comput. Surv., 41(2):1{31, 2009).
29. J. C. Shafer, R. Agrawal, and M. Mehta. Sprint: A
scalable parallel classifier for data mining. In VLDB '96: Proceedings of the
22th International Conference on Very Large Data Bases, pages 544{555, San
Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.
30. I. H. Witten and E. Frank. Data Mining: Practical
Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2
edition, 2005.
31. H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker.
Map-reduce-merge: simplified relational data processing on large clusters. In
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on
Management of data, pages 1029{1040, New York, NY, USA, 2007