[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Frequent itemset mining (FIM) is one of the prominent techniques to extract knowledge from transactional databases. Finding frequent itemsets becomes tiresome while dealing with large-scale datasets due to the enormous demand for computational resources. Cluster-based versions of FIM algorithms become a natural choice to make FIM algorithms efficient and scalable for large-scale datasets and to meet the ever-growing data needs. A variety of MapReduce-based algorithms on the Hadoop cluster has been developed to meet the expectations. Due to the iterative nature of the FIM algorithms, they are still unable to perform adequately. Bottlenecks associated with MapReduce-based FIMs include challenges originating from adapting FIM algorithms in parallel and distributed contexts, as well as reading and publishing intermediate results to HDFS, which necessitates significant disc and communication I/O. Many FIM algorithms have been redesigned on Spark to achieve efficiency for large-scale datasets by utilizing its in-memory processing capabilities. However, Spark-based FIM adaptations still face the same challenges except achieving memory efficiency. The limitations associated with basic FIM algorithms, such as repeated scanning of input data, massive candidate generation, and maintaining large conditional trees in memory, still need to be addressed. Also, tuning of Spark’s shuffle behavior becomes important to control the communication overhead. In this paper, we propose a Spark-based algorithm, namely PartEclat. It uses the Eclat method in combination with the partitioning technique to gain efficiency and scalability. Vertical data format helps to avoid repeated scanning of input data to calculate individual support values. It utilizes the benefits of the partitioning technique to limit the movement of key-value pairs across the cluster nodes (shuffle overheads) during iterations. Also, it helps to deal efficiently with the memory requirements to handle large Transaction ID (TID) sets and computational overhead imposed in computing the intersection of these large TID sets. In-depth experiments with benchmark datasets have been conducted to gain insight into the efficiency and scalability performance of the algorithm. Experimental results show that the proposed scheme outperforms other Spark-based Eclat variants by reducing network and computing loads by approx. 25–40%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Aggarwal, C.C., Bhuiyan, M.A., Hasan, M.A.: Frequent pattern mining algorithms A survey. Frequent pattern mining Springer, Cham (2014)

    Book  Google Scholar 

  2. Han, J., Kamber, M., Pei, J.: Data mining concepts and techniques third edition. Morgan Kaufmann Series Data Manag Sys 5(4), 83–124 (2011)

    Google Scholar 

  3. Zhao, Qiankun, and Sourav S. Bhowmick (2003) Association rule mining: A survey. Nanyang Technological University, Singapore 135

  4. Agrawal, R., Srikant, R.: VLDB 1215, 487–499 (1994)

    Google Scholar 

  5. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29(2), 1–12 (2000)

    Article  Google Scholar 

  6. Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)

    Article  Google Scholar 

  7. Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsl 14(2), 1–5 (2013)

    Article  Google Scholar 

  8. Wu, X., Zhu, X., Gong-Qing, Wu., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)

    Google Scholar 

  9. Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurr. 7(4), 14–25 (1999)

    Article  Google Scholar 

  10. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  11. The Apache Software Foundation. “Apache Spark: Lightning-fast cluster computing”, March 2021, http://spark.apache.org/. Spark1.6.0

  12. Li, Ning, Li Zeng, Qing He, and Zhongzhi Shi. Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp. 236–241. IEEE, 2012.

  13. Singh, Sudhakar, Rakhi Garg, and P. K. Mishra (2017) Review of apriori based algorithms on mapreduce framework." arXiv preprint arXiv:1702.06284

  14. Luna, J.M., Padillo, F., Pechenizkiy, M., Ventura, S.: Apriori versions based on mapreduce for mining frequent patterns on big data. IEEE Transact Cybern 48(10), 2851–2865 (2017)

    Article  Google Scholar 

  15. Raj, S., Ramesh, D., Sethi, K.K.: A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 77, 133–151 (2021)

    Article  Google Scholar 

  16. Castro, E.P.S., Maia, T.D., Pereira, M.R., Esmin, A.A.A., Pereira, D.A.: Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review (2018). https://doi.org/10.1017/S0269888918000127

    Article  Google Scholar 

  17. Li, Haoyuan, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang (2008). Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 107–114

  18. Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of frequent item sets. J. Parallel Distributed Comput. 61(3), 350–371 (2001)

    Article  MATH  Google Scholar 

  19. Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005)

    Article  Google Scholar 

  20. Zhou, Le, Zhiyong Zhong, Jin Chang, Junjie Li, Joshua Zhexue Huang, and Shengzhong Feng (2010) Balanced parallel fp-growth with mapreduce. In 2010 IEEE youth conference on information, computing and telecommunications, pp. 243–246. IEEE

  21. Deng, Lingling, and Yuansheng Lou (2015) Improvement and research of FP-growth algorithm based on distributed spark." In 2015 International Conference on Cloud Computing and Big Data (CCBD), pp. 105–108. IEEE

  22. Zaki, Mohammed J., and Karam Gouda (2003) Fast vertical mining using diffsets." In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 326–335

  23. Liu, J., Yongsheng, Wu., Zhou, Q., Fung, B., Chen, F., Binxiao, Yu.: Parallel eclat for opportunistic mining of frequent itemsets. In Database and Expert Systems Applications, Springer, Cham (2015)

    Book  Google Scholar 

  24. Singh, P., Sudhakar Singh, P.K., Mishra, and Rakhi Garg,: RDD-Eclat: approaches to parallelize Eclat algorithm on spark RDD framework. In: International Conference on Computer Networks and Inventive Communication Technologies, pp. 755–768. Springer, Cham (2019)

    Google Scholar 

  25. Zhang, C., Tian, P., Zhang, X., Jiang, Z.L., Yao, L., Wang, X.: Fast eclat algorithms based on minwise hashing for large scale transactions. IEEE Internet Things J. 6(2), 3948–3961 (2018)

    Article  Google Scholar 

  26. Lin, Xueyan (2014) Mr-apriori Association rules algorithm based on mapreduce. In 2014 IEEE 5th international conference on software engineering and service science, pp. 141–144. IEEE

  27. Lin, Ming-Yen, Pei-Yu Lee, and Sue-Chen Hsueh (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th international conference on ubiquitous information management and communication, pp. 1–8

  28. Chon, K.-W., Kim, M.-S.: BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Clust. Comput. 21(3), 1507–1520 (2018)

    Article  Google Scholar 

  29. Chen, Zhuang, Shibang Cai, Qiulin Song, and Chonglai Zhu (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 1908–1911. IEEE,

  30. Xun, Y., Zhang, J., Qin, X.: Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE transac Sys, Man, and Cybernetics 46(3), 313–325 (2015)

    Article  Google Scholar 

  31. Moens, Sandy, Emin Aksehirli, and Bart Goethals (2013) Frequent itemset mining for big data. In 2013 IEEE international conference on big data, 111–118. IEEE

  32. Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)

    Article  Google Scholar 

  33. Qiu, Hongjian, Rong Gu, Chunfeng Yuan, and Yihua Huang (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pp. 1664–1671. IEEE

  34. Rathee, Sanjay, Manohar Kaul, and Arti Kashyap (2015) R-Apriori: an efficient apriori based algorithm on spark. In Proceedings of the 8th workshop on Ph. D. Workshop in information and knowledge management, pp. 27–34

  35. Sethi, K.K., Ramesh, D.: HFIM a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)

    Article  Google Scholar 

  36. Rathee, S., Kashyap, A.: Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark. J. Big Data 5(1), 1–17 (2018)

    Article  Google Scholar 

  37. Raj, S., Dharavath Ramesh, M., Sreenu, and Krishan Kumar Sethi,: EAFIM Efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowledge Informat Syst 62(9), 3565–3583 (2020)

    Article  Google Scholar 

  38. Shi, Xiujin, Shaozong Chen, and Hui Yang (2017) DFPS: Distributed FP-growth algorithm based on Spark." In 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1725–1731. IEEE

  39. Xiao, W., Juan, Hu.: SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J. Supercomput. 76(10), 7619–7634 (2020)

    Article  Google Scholar 

  40. Yang, Shaosong, Guoyan Xu, Zhijian Wang, and Fachao Zhou (2015) The parallel improved Apriori algorithm research based on Spark In 2015 Ninth International Conference on Frontier of Computer Science and Technology, pp. 354–359. IEEE

  41. Frequent Itemset Mining Dataset repository. February, 2021, http://fimi.uantwerpen.be/data/

  42. SPMF: An Open-Source Data Mining Library, February, 2021. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php

  43. UCI Machine Learning Repository, February, 2021. https://archive.ics.uci.edu/ml/datasets.php

Download references

Acknowledgements

The authors would like to express their heartiest thanks to the Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for their research support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dharavath Ramesh.

Ethics declarations

Conflict of interest

The authors declare that they do not have any commercial or associative interest representing a conflict of interest connected with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raj, S., Ramesh, D. PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique. Cluster Comput 25, 4463–4480 (2022). https://doi.org/10.1007/s10586-022-03673-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03673-5

Keywords

Navigation