计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 41-46.doi: 10.11896/jsjkx.191000103
张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚
ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang
摘要: MapReduce是一种适用于大数据处理的重要并行计算框架,通过在大量集群节点上并行执行多个任务,极大地提高了数据的处理性能。然而,由于中间数据需要等到Mapper任务完成之后才能被发送给Reducer任务,由此导致的大量传输延迟成为MapReduce框架性能的重要瓶颈。为此,文中提出了一种面向MapReduce的中间数据传输流水线优化机制,将有效计算与中间数据传输解耦,以流水线的方式重叠执行各个阶段,有效隐藏数据传输开销。文中还给出了中间数据传输流水线执行机制和实现策略,包括流水线划分、数据细分、数据归并和数据传输粒度等。在公开数据集上对所提中间数据传输流水线优化机制进行了评价,当Shuffle数据量较大时,该优化机制比默认框架的整体性能提高了60.2%。
中图分类号:
[1] DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [2] WANG S,WANG H J,QIN X P.Architecting Big Data:Challenges,Studies and Forecasts[J].Chinese Journal of Compu-ters,2011,34(10):1741-1752. [3] ZAHARIA M,CHOWDHURY M,FRANKLIN M,et al.Spark:cluster computing with working sets[C]//Usenix Conference on Hot Topics in Cloud Computing.Berkeley:USENIX Association,2010:1-12. [4] XUN Y L,ZHANG J F,QIN X.Data Placement Strategy for MapReduce Cluster Environment[J].Jounary of Software,2015(8):2056-2073. [5] AHMAD F,LEE S,THOTTETHODI M,et al.MapReducewith communication overlap (MaRCO)[J].Journal of Parallel &Distributed Computing,2013,73(5):608-620. [6] YU W,WANG Y,QUE X.Design and Evaluation of Network-Levitated Merge for Hadoop Acceleration[J].IEEE Transactions on Parallel & Distributed Systems,2014,25(3):602-611. [7] CHOWDHURY M,ZAHARIA M,MA J,et al.Managing data transfers in computer clusters with orchestra[J].ACM SIGCOMM Computer Communication Review,2011,41(4):98-109. [8] LI J J,WU J,YANG X L,et al.Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce[C]//IEEE International Conference on Parallel Processing.2015:939-948. [9] RAHMAN M W,ISLAM N S,LU X,et al.A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1. [10] ARSLAN E,SHEKHAR M,KOSAR T.Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications[C]//International Workshop on Data-intensive Computing in the Clouds.IEEE,2014. [11] CAO Y,WANG H.Communication optimization for interme-diate data of MapReduce computing model[J].Jounary of Computer Applications,2018,38(4):1078-1083. [12] LIU S,WANG H,LI B.Optimizing Shuffle in Wide-Area Data Analytics[C]//2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).IEEE Computer Socie-ty,2017. [13] SEO S,JANG I,WOO K,et al.HPMR:Prefetching and Pre-Shuffling in Shared MapReduce Computation Environment[C]//Proc of the IEEE International Conference on Cluster Computing and Workshops.2009:1-8. [14] ZHANG S,HAN J,LIU Z,et al.Accelerating MapReduce with Distributed Memory Cache [C]//International Conference on Parallel & Distributed Systems.2009:472-478. [15] WANG B,JIANG J,YANG G.mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters [C]//NPC 2014.2014:220-233. [16] LI J,LIN X,CUI X,et al.Improving the Shuffle of Hadoop Map-Reduce[C]//IEEE International Conference on Cloud Computing Technology & Science.IEEE,2014. [17] QI K Y,HAN Y B,ZHAO Z F.MapReduce Intermediate Result Cache for Concurrent Data Stream Processing[J].Journal of Computer Research and Development,2013,50(1):111-121. [18] WANG J,QIU M,GUO B,et al.Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce[J].IEEE Transactions on Cloud Computing,2015,8(2):418-431. [19] TAN J,CHIN A,HU Z Z,et al.DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling[C]//Proc of the Ninth European Conference on Computer Systems.ACM,2014:1-14. [20] QIN X P,WANG H J,DU X Y.Big Data Analysis-Computition and Symbiosis of RDBMS and MapReduce[J].Journal of Software,2012,23(1):32-45. [21] KAMBATLA K,KOLLIAS G,KUMAR V,et al.Trends in Big Data Analytics [J].Journal of Parallel & Distributed Computing,2014,74(7):2561-2573. [22] HO L Y,WU J J,LIU P.Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework [C]//IEEE International Conference on Cloud Computing.2011. [23] YAN W M,WANG W M.Data Structure[M].Beijing:Tsinghua University Press,2013:297-304. |
[1] | 刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040 |
[2] | 傅思清, 黎铁军, 张建民. 面向粒子输运程序加速的体系结构设计 Architecture Design for Particle Transport Code Acceleration 计算机科学, 2022, 49(6): 81-88. https://doi.org/10.11896/jsjkx.210600179 |
[3] | 王国澎, 杨剑新, 尹飞, 蒋生健. 负载均衡的处理器运算资源分配方法 Computing Resources Allocation with Load Balance in Modern Processor 计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148 |
[4] | 陈晓杰,周清雷,李斌. 基于FPGA的7-Zip加密文档高能效口令恢复方法 Energy-efficient Password Recovery Method for 7-Zip Document Based on FPGA 计算机科学, 2020, 47(1): 321-328. https://doi.org/10.11896/jsjkx.190100027 |
[5] | 吴琪, 王兴伟, 黄敏. 基于SDN的OpenFlow交换机数据包流水线处理机制 OpenFlow Switch Packets Pipeline Processing Mechanism Based on SDN 计算机科学, 2018, 45(10): 295-299. https://doi.org/10.11896/j.issn.1002-137X.2018.10.055 |
[6] | 张少楠,邱柯妮,张伟功,王晶,郑佳欣,白瑞英,朱晓燕. 基于排队论的UM-BUS总线性能建模与评估 Queuing Theory-guided Performance Evaluation on Reconfigurable High-speed Device Connected Bus 计算机科学, 2017, 44(Z6): 504-509. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.112 |
[7] | 刘翱,邓旭东,李维刚. 基于自适应控制参数的改进水波优化算法 Improved Water Wave Optimization Algorithm with Adaptive Control Parameters 计算机科学, 2017, 44(7): 203-209. https://doi.org/10.11896/j.issn.1002-137X.2017.07.036 |
[8] | 赵月,任永功,刘洋. 基于MapReduce的改进的Apriori算法及其应用研究 Improved Apriori Algorithm and Its Application Based on MapReduce 计算机科学, 2017, 44(6): 250-254. https://doi.org/10.11896/j.issn.1002-137X.2017.06.043 |
[9] | 都志辉,林璋熙,顾彦祺,Eric O.LEBIGOT,郭翔宇. 引力波cWB处理流水线的GPU加速 GPU Accelerated cWB Pipeline for Gravitational Waves Discovery 计算机科学, 2017, 44(10): 26-32. https://doi.org/10.11896/j.issn.1002-137X.2017.10.005 |
[10] | 张丽红,余世明. 求解置换流水线调度问题的改进萤火虫优化算法 Improved GSO Algorithm for Solving PFSP 计算机科学, 2016, 43(8): 240-243. https://doi.org/10.11896/j.issn.1002-137X.2016.08.048 |
[11] | 尹孟嘉,许先斌,熊曾刚,张 涛. GPU矩阵乘法的性能定量分析模型 Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU 计算机科学, 2015, 42(12): 13-17. |
[12] | 王卓薇,程良伦,赵武清. 基于GPU的并行计算性能分析模型 Parallel Computation Performance Analysis Model Based on GPU 计算机科学, 2014, 41(1): 31-38. |
[13] | 周季华,叶春明,盛晓华. 基于智能水滴算法置换流水线调度问题的研究 Research on Permutation Flow-shop Scheduling Problem by Intelligent Water Drop Algorithm 计算机科学, 2013, 40(9): 250-253. |
[14] | 刘莹,谷文祥,李向涛. 置换流水线车间调度问题的研究 Research on Permutation Flow-shop Scheduling Problem 计算机科学, 2013, 40(11): 1-7. |
[15] | 陈德华,周蒙,孙延青,郑亮亮. MR-GSpar:一种基于MapReduce的大图稀疏化算法 MR-GSpar:A Distributed Large Graph Sparsification Algorithm Based on MapReduce 计算机科学, 2013, 40(10): 190-193. |
|