计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 409-411.
丁勇, 朱长水, 武玉艳
DING Yong, ZHU Chang-shui, WU Yu-yan
摘要: 传统的并行关联规则算法对每一次迭代都定义一个MapReduce任务,以实现候选项集的生成和计数功能,但多次启动MapReduce任务会带来极大的性能开销。文中定义了一种并行关联规则挖掘算法PST-Apriori,该算法采取分治策略,在每个分布式计算节点定义一个前缀共享树,通过递归调用的方式将事务T生成的候选项集逐层压缩到前缀共享树(PST)中。然后广度遍历PST,逐层将每个节点对应的〈key,value〉作为map函数的输入,并由Map-Reduce框架自动按照key值进行聚集。最后调用reduce函数对多个任务的处理结果进行汇总,得到满足最小支持度阈值的频繁项集。算法只使用两个MapReduce任务,且PST按照key值排序便于Mapper端的shuffle操作,提高了运行效率。
中图分类号:
[1]AGRAWAL R,SRIKANT R.Fast algorithms for mining associa-tion rules(3rd ed)[M]∥Readings in Database Systems.Morgan Kaufmann Publishers Inc.,1998:2299-2308. [2]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2000:1-12. [3]LI L,ZHANG M.The Strategy of Mining Association Rule Based on Cloud Computing[C]∥International Conference on Business Computing and Global Informatization.IEEE,2011:475-478. [4]LI N,ZENG L,HE Q,et al.Parallel Implementation of Apriori Algorithm Based on MapReduce[C]∥Acis International Conference on Software Engineering,Artificial Intelligence,Networking and Parallel & Distributed Computing.IEEE,2012:236-241. [5]ZHOU X,HUANG Y.An Improved Parallel Association Rules Algorithm Based on MapReduce Framework for Big Data[C]∥International Conference on Fuzzy Systems and Knowledge Discovery.IEEE,2014:284-288. [6]郝天曙.基于Hadoop的并行数据挖掘的研究[D].南京:南京邮电大学,2017. [7]张玲.基于Hadoop平台并行关联规则挖掘算法研究[D].西安:西安科技大学,2017. [8]荀亚玲.集群环境下的关联规则挖掘及应用[D].太原:太原科技大学,2017. [9]于跃.基于Hadoop平台的并行化分布式关联规则挖掘算法研究[D].吉林:吉林大学,2017. [10]李若晨.基于并行的Apriori数据挖掘算法的研究[D].吉林:吉林大学,2017. [11]叶璐.基于Spark的改进关联规则算法研究[D].太原:太原科技大学,2017. [12]马连灯.基于HADOOP平台的并行关联规则算法研究[D].天津:天津工业大学,2017. |
[1] | 刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040 |
[2] | 曹扬晨, 朱国胜, 孙文和, 吴善超. 未知网络攻击识别关键技术研究 Study on Key Technologies of Unknown Network Attack Identification 计算机科学, 2022, 49(6A): 581-587. https://doi.org/10.11896/jsjkx.210400044 |
[3] | 田冰川, 田臣, 周宇航, 陈贵海, 窦万春. 减少Hadoop集群中网络队头阻塞的调度算法 Reducing Head-of-Line Blocking on Network in Hadoop Clusters 计算机科学, 2022, 49(3): 11-22. https://doi.org/10.11896/jsjkx.210900117 |
[4] | 徐慧慧, 晏华. 基于相对危险度的儿童先心病风险因素分析算法 Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children 计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082 |
[5] | 沈夏炯, 杨继勇, 张磊. 基于不相关属性集合的属性探索算法 Attribute Exploration Algorithm Based on Unrelated Attribute Set 计算机科学, 2021, 48(4): 54-62. https://doi.org/10.11896/jsjkx.200800082 |
[6] | 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚. 面向MapReduce的中间数据传输流水线优化机制 Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework 计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103 |
[7] | 张素梅, 张波涛. 一种基于量子耗散粒子群的评估模型构建方法 Evaluation Model Construction Method Based on Quantum Dissipative Particle Swarm Optimization 计算机科学, 2020, 47(6A): 84-88. https://doi.org/10.11896/JsJkx.190900148 |
[8] | 陈孟辉, 曹黔峰, 兰彦琦. 基于区块挖掘与重组的启发式算法求解置换流水车间调度问题 Heuristic Algorithm Based on Block Mining and Recombination for Permutation Flow-shop Scheduling Problem 计算机科学, 2020, 47(6A): 108-113. https://doi.org/10.11896/JsJkx.190300151 |
[9] | 崔巍, 贾晓琳, 樊帅帅, 朱晓燕. 一种新的不均衡关联分类算法 New Associative Classification Algorithm for Imbalanced Data 计算机科学, 2020, 47(6A): 488-493. https://doi.org/10.11896/JsJkx.190600132 |
[10] | 王青松, 姜富山, 李菲. 大数据环境下基于关联规则的多标签学习算法 Multi-label Learning Algorithm Based on Association Rules in Big Data Environment 计算机科学, 2020, 47(5): 90-95. https://doi.org/10.11896/jsjkx.190300150 |
[11] | 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110 |
[12] | 王童, 马文平, 罗维. 基于区块链的信息共享及安全多方计算模型 Information Sharing and Secure Multi-party Computing Model Based on Blockchain 计算机科学, 2019, 46(9): 162-168. https://doi.org/10.11896/j.issn.1002-137X.2019.09.023 |
[13] | 张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037 |
[14] | 张维国. 面向知识推荐服务的选课决策 Decision Making of Course Selection Oriented by Knowledge Recommendation Service 计算机科学, 2019, 46(6A): 507-510. |
[15] | 贾宁, 李瑛达. 基于智能可穿戴设备的个性化健康监管平台的构建 Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device 计算机科学, 2019, 46(6A): 566-570. |
|