[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3627535.3638498acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

GraphCube: Interconnection Hierarchy-aware Graph Processing

Published: 20 February 2024 Publication History

Abstract

Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes because they are oblivious to the communication cost variations across the interconnection hierarchy. We introduce GraphCube, a better approach to optimizing graph processing on large-scale parallel systems with complex interconnections. GraphCube features a new graph partitioning approach to achieve better load balancing and minimize communication overhead across multiple levels of the interconnection hierarchy. We evaluate GraphCube by applying it to fundamental graph operations performed on synthetic and real-world graph datasets. Our evaluation used up to 79,024 computing nodes and 1.2+ million processor cores. Our large-scale experiments show that GraphCube outperforms state-of-the-art parallel graph processing methods in throughput and scalability. Furthermore, GraphCube outperformed the top-ranked systems on the Graph 500 list.

Supplementary Material

PDF File (p160-gan-supp.pdf)
Supplemental material.

References

[1]
Zainab Abbas, Vasiliki Kalavri, Paris Carbone, and Vladimir Vlassov. 2018. Streaming graph partitioning: an experimental study. Proceedings of the VLDB Endowment 11, 11 (2018), 1590--1603.
[2]
Soramichi Akiyama. 2020. Assessing Impact of Data Partitioning for Approximate Memory in C/C++ Code. arXiv preprint arXiv:2004.01637 (2020).
[3]
Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--10.
[4]
Huanqi Cao, Yuanwei Wang, Haojie Wang, Heng Lin, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling graph traversal to 281 trillion edges with 40 million cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 234--245.
[5]
Donglin Chen, Jianbin Fang, Shizhao Chen, Chuanfu Xu, and Zheng Wang. 2019. Optimizing sparse matrix-vector multiplications on an armv8-based many-core architecture. International Journal of Parallel Programming 47, 3 (2019), 418--432.
[6]
Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, and Zheng Wang. 2020. Characterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+. International Journal of Parallel Programming 48, 1 (2020), 80--97.
[7]
R. Chen, J. Shi, Y. Chen, and H. Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. European Conference on Computer Systems (2015), 1--15.
[8]
Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, and Haibo Chen. 2018. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs. ACM Trans. Parallel Comput. 5, 3 (2018), 13:1--13:39.
[9]
Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, and Haibo Chen. 2019. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. ACM Transactions on Parallel Computing (TOPC) 5, 3 (2019), 1--39.
[10]
Yongzhi Chen and Yuefan Deng. 2009. A detailed analysis of communication load balance on BlueGene supercomputer. Computer physics communications 180, 8 (2009), 1251--1258.
[11]
William James Dally and Brian Patrick Towles. 2004. Principles and practices of interconnection networks. Elsevier.
[12]
Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. 2018. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. 752--768.
[13]
Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.
[14]
Jack Dongarra. 2020. Report on the Fujitsu Fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06 (2020).
[15]
Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, et al. 2021. GraphScope: a unified engine for big graph processing. Proceedings of the VLDB Endowment 14, 12 (2021), 2879--2892.
[16]
Wenfei Fan, Muyang Liu, Chao Tian, Ruiqi Xu, and Jingren Zhou. 2020. Incrementalization of graph partitioning algorithms. Proceedings of the VLDB Endowment 13, 8 (2020), 1261--1274.
[17]
Wenfei Fan, Ruiqi Xu, Qiang Yin, Wenyuan Yu, and Jingren Zhou. 2022. Application-driven graph partitioning. The VLDB Journal (2022), 1--24.
[18]
Jianbin Fang, Peng Zhang, Chun Huang, Tao Tang, Kai Lu, Ruibo Wang, and Zheng Wang. 2022. Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000. arXiv preprint arXiv:2210.12230 (2022).
[19]
Xing Feng, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Computing connected components with linear communication cost in pregel-like systems. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 85--96.
[20]
Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. Map-Graph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings of workshop on GRAph data management experiences and systems. 1--6.
[21]
Pablo Fuentes, Mariano Benito, Enrique Vallejo, José Luis Bosque, Ramön Beivide, Andreea Anghel, Germán Rodríguez, Mitch Gusat, Cyriel Minkenberg, and Mateo Valero. 2017. A scalable synthetic traffic model of Graph500 for computer networks analysis. Concurrency and Computation: Practice and Experience 29, 24 (2017), e4231.
[22]
Xinbiao Gan, Yiming Zhang, Ruibo Wang, Tiejun Li, Tiaojie Xiao, Ruigeng Zeng, Jie Liu, and Kai Lu. 2021. TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer. IEEE Transactions on Parallel and Distributed Systems (2021).
[23]
Xinbiao Gan, Yiming Zhang, Ruigeng Zeng, Jie Liu, Ruibo Wang, Tiejun Li, Li Chen, and Kai Lu. 2022. XTree: Traversal-Based Partitioning for Extreme-Scale Graph Processing on Supercomputers. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2046--2059.
[24]
Tao Gao, Yutong Lu, Baida Zhang, and Guang Suo. 2014. Using the intel many integrated core to accelerate graph traversal. The International journal of high performance computing applications 28, 3 (2014), 255--266.
[25]
Sayan Ghosh, Nathan R Tallent, and Mahantesh Halappanavar. 2021. Characterizing Performance of Graph Neighborhood Communication Patterns. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 915--928.
[26]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. 17--30.
[27]
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 599--613.
[28]
Samuel Grossman, Heiner Litz, and Christos Kozyrakis. 2018. Making pull-based graph processing performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 246--260.
[29]
http://graph500.org/. 2021. The Graph 500 List. https://graph500.org/ Last accessed 03 March 2022.
[30]
Eu Inc. 2022. url.eu-2015. https://law.di.unimi.it/webdata/eu-2015/ Last accessed 03 December 2022.
[31]
Twitter Inc. 2021. twitter-2010. https://law.di.unimi.it/webdata/twitter-2010/ Last accessed 03 December 2021.
[32]
Keita Iwabuchi, Hitoshi Sato, Yuichiro Yasui, Katsuki Fujisawa, and Satoshi Matsuoka. 2014. NVM-based hybrid BFS with memory efficient data structure. In 2014 IEEE International Conference on Big Data (Big Data). IEEE, 529--538.
[33]
George Karypis and Vipin Kumar. 1995. METIS-unstructured graph partitioning and sparse matrix ordering system, version 2.0. (1995).
[34]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359--392.
[35]
George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997).
[36]
John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77--88.
[37]
Deyu Kong, Xike Xie, and Zhuoxu Zhang. 2022. Clustering-based Partitioning for Large Web Graphs. arXiv preprint arXiv:2201.00472 (2022).
[38]
Dongsheng Li, Yiming Zhang, Jinyan Wang, and KianLee Tan. 2019. TopoX: Topology Refactorization for Efficient Graph Partitioning and Processing. PVLDB 12, 8 (2019), 891--905.
[39]
Dongsheng Li, Yiming Zhang, Jinyan Wang, and Kian-Lee Tan. 2019. TopoX: Topology refactorization for efficient graph partitioning and processing. Proceedings of the VLDB Endowment 12, 8 (2019), 891--905.
[40]
Z Li, C Wu, and Y Li. 2021. FEP-based large-scale virtual screening for effective drug discovery against COVID-19. In Int. Conf. High Performance Computing, Networking, Storage, and Analysis.
[41]
Xiangke LIAO, Liquan XIAO, Canqun YANG, and Yutong LU. [n.d.]. MilkyWay-2 supercomputer: system and application. ([n. d.]).
[42]
Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology 30, 2 (2015), 259--272.
[43]
Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable graph traversal on sunway taihulight with ten million cores. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 635--645.
[44]
Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wenguang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, et al. 2018. Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 706--716.
[45]
Yucheng Low. 2013. Graphlab: A distributed abstraction for large scale machine learning. University of California (2013).
[46]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5, 8 (2012), 716--727.
[47]
Kai Lu, Yaohua Wang, Yang Guo, Chun Huang, Sheng Liu, Ruibo Wang, Jianbin Fang, Tao Tang, Zhaoyun Chen, Biwei Liu, et al. 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Transactions on High Performance Computing (2022), 1--15.
[48]
Meilian Lu, Zhenglin Zhang, Zhihe Qu, and Yu Kang. 2018. LPANNI: Overlapping community detection using label propagation in large-scale complex networks. IEEE Transactions on Knowledge and Data Engineering 31, 9 (2018), 1736--1749.
[49]
Yutong Lu. 2019. Paving the way for China exascale computing. CCF Transactions on High Performance Computing 1, 2 (2019), 63--72.
[50]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2009. Pregel: a system for large-scale graph processing. Sigmod (2009), 135--146.
[51]
Ruben Mayer and Hans-Arno Jacobsen. 2021. Hybrid edge partitioner: partitioning large power-law graphs under memory constraints. In Proceedings of the 2021 International Conference on Management of Data. 1289--1302.
[52]
Andrey Molyakov. 2019. Age of Great Chinese Dragon: Supercomputer Centers and High Performance Computing. Journal of Electrical and Electronic Engineering 7, 4 (2019), 87--94.
[53]
Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama, and Mitsuhisa Sato. 2020. Performance Evaluation of Supercomputer Fugaku using Breadth-First Search Benchmark in Graph500. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 408--409.
[54]
Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama, and Mitsuhisa Sato. 2021. Performance of the supercomputer fugaku for breadth-first search in graph500 benchmark. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24--July 2, 2021, Proceedings 36. Springer, 372--390.
[55]
Mahdi Nikdast, Jiang Xu, Luan HK Duong, Xiaowen Wu, Zhehui Wang, Xuan Wang, and Zhe Wang. 2014. Fat-tree-based optical interconnection networks under crosstalk noise constraint. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 1 (2014), 156--169.
[56]
Joel Nishimura and Johan Ugander. 2013. Restreaming graph partitioning: simple versatile algorithms for advanced balancing. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1106--1114.
[57]
Anil Pacaci and M Tamer Özsu. 2019. Experimental analysis of streaming algorithms for graph partitioning. In Proceedings of the 2019 International Conference on Management of Data. 1375--1392.
[58]
Julian Shun and Guy E Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory. In ACM Sigplan Notices, Vol. 48. ACM, 135--146.
[59]
George M Slota, Cameron Root, Karen Devine, Kamesh Madduri, and Sivasankaran Rajamanickam. 2020. Scalable, multi-constraint, complex-objective graph partitioning. IEEE Transactions on Parallel and Distributed Systems 31, 12 (2020), 2789--2801.
[60]
Hari Subramoni, Albert Mathews Augustine, Mark Arnold, Jonathan Perkins, Xiaoyi Lu, Khaled Hamidouche, and Dhabaleswar K Panda. 2016. INAM 2: InfiniBand Network Analysis and Monitoring with MPI. In International Conference on High Performance Computing. Springer, 300--320.
[61]
Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa, and Satoshi Matsuoka. 2011. Performance characteristics of Graph500 on large-scale distributed environment. In 2011 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 149--158.
[62]
Koji Ueno and Toyotaro Suzumura. 2012. 2d partitioning based graph search for the graph500 benchmark. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 1925--1931.
[63]
Koji Ueno and Toyotaro Suzumura. 2012. Highly scalable graph search for the graph500 benchmark. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. 149--160.
[64]
Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka. 2016. Extreme scale breadth-first search on supercomputers. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1040--1047.
[65]
Carnegie Mellon University. 2021. ClueWeb12 Dataset. https://lemurproject.org/clueweb12/ Last accessed 03 December 2021.
[66]
Erik Vermij, Leandro Fiorin, Christoph Hagleitner, and Koen Bertels. 2017. Boosting the efficiency of HPCG and Graph500 with near-data processing. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 31--40.
[67]
Ruibo Wang, Kai Lu, Juan Chen, Wenzhe Zhang, Jinwen Li, Yuan Yuan, Pingjing Lu, Libo Huang, Shengguo Li, and Xiaokang Fan. 2020. Brief introduction of TianHe exascale prototype system. Tsinghua Science and Technology 26, 3 (2020), 361--369.
[68]
Yuanwei Wang, Huanqi Cao, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling graph 500 SSSP to 140 trillion edges with over 40 million cores. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 248--262.
[69]
Min Xie, Yutong Lu, Kefei Wang, Lu Liu, Hongjia Cao, et al. 2011. Tianhe-1a interconnect and message-passing services. IEEE Micro 32, 1 (2011), 8--20.
[70]
Xue-Jun Yang, Xiang-Ke Liao, Kai Lu, Qing-Feng Hu, Jun-Qiang Song, and Jin-Shu Su. 2011. The TianHe-1A supercomputer: its hardware and software. Journal of computer science and technology 26, 3 (2011), 344--351.
[71]
Yuichiro Yasui and Katsuki Fujisawa. 2015. Fast and scalable NUMA-based thread parallel breadth-first search. In 2015 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 377--385.
[72]
Yuichiro Yasui, Katsuki Fujisawa, and Kazushige Goto. 2013. NUMA-optimized parallel breadth-first search on multicore single-node system. In 2013 IEEE International Conference on Big Data. IEEE, 394--402.
[73]
Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. 2005. A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In SC'05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. IEEE, 25--25.
[74]
Jason Jongjin Yoo and Alan E Willner. 2001. A Performance and Implementation Comparison of Bidirectional and Dual Bus 2-D WDM Multiple-Plane Optical Interconnections with Row-ColumnMultihop Network Structures. Journal of lightwave technology 19, 6 (2001), 801.
[75]
Xin You, Hailong Yang, Zhongzhi Luan, Yi Liu, and Depei Qian. 2019. Performance evaluation and analysis of linear algebra kernels in the prototype tianhe-3 cluster. In Asian Conference on Supercomputing Frontiers. Springer, 86--105.
[76]
Jeffrey Young, Julian Romera, Matthias Hauck, and Holger Fröning. 2016. Optimizing communication for a 2D-partitioned scalable BFS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.
[77]
Chenglong Zhang. 2020. A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal. arXiv preprint arXiv:2009.07463 (2020).
[78]
Hongyang Zhang, Peter Lofgren, and Ashish Goel. 2016. Approximate personalized pagerank on dynamic graphs. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1315--1324.
[79]
Yiming Zhang, Haonan Wang, Menghan Jia, Jinyan Wang, Dong sheng Li, Guangtao Xue, and K. Tan. 2020. TopoX: Topology Refactorization for Minimizing Network Communication in Graph Computations. IEEE/ACM Transactions on Networking 28 (2020), 2768--2782.
[80]
Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A Computation-Centric Distributed Graph Processing System. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 301--316. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/zhu

Cited By

View all
  • (2024)GraphService: Topology-aware Constructor for Large-scale Graph ApplicationsACM Transactions on Architecture and Code Optimization10.1145/3689341Online publication date: 17-Aug-2024
  • (2024)Doubling Graph Traversal Efficiency to 198 TeraTEPS on the Supercomputer FugakuProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00107(1-14)Online publication date: 17-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
March 2024
498 pages
ISBN:9798400704352
DOI:10.1145/3627535
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph processing
  2. graph partitioning
  3. parallel computing
  4. vectorization
  5. Graph500

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Foundation of China

Conference

PPoPP '24

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)414
  • Downloads (Last 6 weeks)41
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)GraphService: Topology-aware Constructor for Large-scale Graph ApplicationsACM Transactions on Architecture and Code Optimization10.1145/3689341Online publication date: 17-Aug-2024
  • (2024)Doubling Graph Traversal Efficiency to 198 TeraTEPS on the Supercomputer FugakuProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00107(1-14)Online publication date: 17-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media