Abstract
Frequent itemset mining (FIM) is one of the prominent techniques to extract knowledge from transactional databases. Finding frequent itemsets becomes tiresome while dealing with large-scale datasets due to the enormous demand for computational resources. Cluster-based versions of FIM algorithms become a natural choice to make FIM algorithms efficient and scalable for large-scale datasets and to meet the ever-growing data needs. A variety of MapReduce-based algorithms on the Hadoop cluster has been developed to meet the expectations. Due to the iterative nature of the FIM algorithms, they are still unable to perform adequately. Bottlenecks associated with MapReduce-based FIMs include challenges originating from adapting FIM algorithms in parallel and distributed contexts, as well as reading and publishing intermediate results to HDFS, which necessitates significant disc and communication I/O. Many FIM algorithms have been redesigned on Spark to achieve efficiency for large-scale datasets by utilizing its in-memory processing capabilities. However, Spark-based FIM adaptations still face the same challenges except achieving memory efficiency. The limitations associated with basic FIM algorithms, such as repeated scanning of input data, massive candidate generation, and maintaining large conditional trees in memory, still need to be addressed. Also, tuning of Spark’s shuffle behavior becomes important to control the communication overhead. In this paper, we propose a Spark-based algorithm, namely PartEclat. It uses the Eclat method in combination with the partitioning technique to gain efficiency and scalability. Vertical data format helps to avoid repeated scanning of input data to calculate individual support values. It utilizes the benefits of the partitioning technique to limit the movement of key-value pairs across the cluster nodes (shuffle overheads) during iterations. Also, it helps to deal efficiently with the memory requirements to handle large Transaction ID (TID) sets and computational overhead imposed in computing the intersection of these large TID sets. In-depth experiments with benchmark datasets have been conducted to gain insight into the efficiency and scalability performance of the algorithm. Experimental results show that the proposed scheme outperforms other Spark-based Eclat variants by reducing network and computing loads by approx. 25–40%.
Similar content being viewed by others
References
Aggarwal, C.C., Bhuiyan, M.A., Hasan, M.A.: Frequent pattern mining algorithms A survey. Frequent pattern mining Springer, Cham (2014)
Han, J., Kamber, M., Pei, J.: Data mining concepts and techniques third edition. Morgan Kaufmann Series Data Manag Sys 5(4), 83–124 (2011)
Zhao, Qiankun, and Sourav S. Bhowmick (2003) Association rule mining: A survey. Nanyang Technological University, Singapore 135
Agrawal, R., Srikant, R.: VLDB 1215, 487–499 (1994)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29(2), 1–12 (2000)
Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000)
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsl 14(2), 1–5 (2013)
Wu, X., Zhu, X., Gong-Qing, Wu., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurr. 7(4), 14–25 (1999)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
The Apache Software Foundation. “Apache Spark: Lightning-fast cluster computing”, March 2021, http://spark.apache.org/. Spark1.6.0
Li, Ning, Li Zeng, Qing He, and Zhongzhi Shi. Parallel implementation of apriori algorithm based on mapreduce. In 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, pp. 236–241. IEEE, 2012.
Singh, Sudhakar, Rakhi Garg, and P. K. Mishra (2017) Review of apriori based algorithms on mapreduce framework." arXiv preprint arXiv:1702.06284
Luna, J.M., Padillo, F., Pechenizkiy, M., Ventura, S.: Apriori versions based on mapreduce for mining frequent patterns on big data. IEEE Transact Cybern 48(10), 2851–2865 (2017)
Raj, S., Ramesh, D., Sethi, K.K.: A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 77, 133–151 (2021)
Castro, E.P.S., Maia, T.D., Pereira, M.R., Esmin, A.A.A., Pereira, D.A.: Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review (2018). https://doi.org/10.1017/S0269888918000127
Li, Haoyuan, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang (2008). Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, pp. 107–114
Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A tree projection algorithm for generation of frequent item sets. J. Parallel Distributed Comput. 61(3), 350–371 (2001)
Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005)
Zhou, Le, Zhiyong Zhong, Jin Chang, Junjie Li, Joshua Zhexue Huang, and Shengzhong Feng (2010) Balanced parallel fp-growth with mapreduce. In 2010 IEEE youth conference on information, computing and telecommunications, pp. 243–246. IEEE
Deng, Lingling, and Yuansheng Lou (2015) Improvement and research of FP-growth algorithm based on distributed spark." In 2015 International Conference on Cloud Computing and Big Data (CCBD), pp. 105–108. IEEE
Zaki, Mohammed J., and Karam Gouda (2003) Fast vertical mining using diffsets." In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 326–335
Liu, J., Yongsheng, Wu., Zhou, Q., Fung, B., Chen, F., Binxiao, Yu.: Parallel eclat for opportunistic mining of frequent itemsets. In Database and Expert Systems Applications, Springer, Cham (2015)
Singh, P., Sudhakar Singh, P.K., Mishra, and Rakhi Garg,: RDD-Eclat: approaches to parallelize Eclat algorithm on spark RDD framework. In: International Conference on Computer Networks and Inventive Communication Technologies, pp. 755–768. Springer, Cham (2019)
Zhang, C., Tian, P., Zhang, X., Jiang, Z.L., Yao, L., Wang, X.: Fast eclat algorithms based on minwise hashing for large scale transactions. IEEE Internet Things J. 6(2), 3948–3961 (2018)
Lin, Xueyan (2014) Mr-apriori Association rules algorithm based on mapreduce. In 2014 IEEE 5th international conference on software engineering and service science, pp. 141–144. IEEE
Lin, Ming-Yen, Pei-Yu Lee, and Sue-Chen Hsueh (2012) Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th international conference on ubiquitous information management and communication, pp. 1–8
Chon, K.-W., Kim, M.-S.: BIGMiner: a fast and scalable distributed frequent pattern miner for big data. Clust. Comput. 21(3), 1507–1520 (2018)
Chen, Zhuang, Shibang Cai, Qiulin Song, and Chonglai Zhu (2011) An improved Apriori algorithm based on pruning optimization and transaction reduction. In 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), pp. 1908–1911. IEEE,
Xun, Y., Zhang, J., Qin, X.: Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE transac Sys, Man, and Cybernetics 46(3), 313–325 (2015)
Moens, Sandy, Emin Aksehirli, and Bart Goethals (2013) Frequent itemset mining for big data. In 2013 IEEE international conference on big data, 111–118. IEEE
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)
Qiu, Hongjian, Rong Gu, Chunfeng Yuan, and Yihua Huang (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pp. 1664–1671. IEEE
Rathee, Sanjay, Manohar Kaul, and Arti Kashyap (2015) R-Apriori: an efficient apriori based algorithm on spark. In Proceedings of the 8th workshop on Ph. D. Workshop in information and knowledge management, pp. 27–34
Sethi, K.K., Ramesh, D.: HFIM a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. 73(8), 3652–3668 (2017)
Rathee, S., Kashyap, A.: Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark. J. Big Data 5(1), 1–17 (2018)
Raj, S., Dharavath Ramesh, M., Sreenu, and Krishan Kumar Sethi,: EAFIM Efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data. Knowledge Informat Syst 62(9), 3565–3583 (2020)
Shi, Xiujin, Shaozong Chen, and Hui Yang (2017) DFPS: Distributed FP-growth algorithm based on Spark." In 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), pp. 1725–1731. IEEE
Xiao, W., Juan, Hu.: SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J. Supercomput. 76(10), 7619–7634 (2020)
Yang, Shaosong, Guoyan Xu, Zhijian Wang, and Fachao Zhou (2015) The parallel improved Apriori algorithm research based on Spark In 2015 Ninth International Conference on Frontier of Computer Science and Technology, pp. 354–359. IEEE
Frequent Itemset Mining Dataset repository. February, 2021, http://fimi.uantwerpen.be/data/
SPMF: An Open-Source Data Mining Library, February, 2021. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
UCI Machine Learning Repository, February, 2021. https://archive.ics.uci.edu/ml/datasets.php
Acknowledgements
The authors would like to express their heartiest thanks to the Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for their research support.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they do not have any commercial or associative interest representing a conflict of interest connected with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Raj, S., Ramesh, D. PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique. Cluster Comput 25, 4463–4480 (2022). https://doi.org/10.1007/s10586-022-03673-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-022-03673-5