More Web Proxy on the site http://driver.im/

research-article

Load Balanced PIM-Based Graph Processing

Authors:

Yi KangAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 29, Issue 4

Article No.: 61, Pages 1 - 22

https://doi.org/10.1145/3659951

Published: 21 June 2024 Publication History

Abstract

Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance.

This article proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation–communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2× and maximum 2.8× speedup compared to the SOTA PIM-based graph processing system GraphQ.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 105–117.

Digital Library

[2]

Tero Aittokallio and Benno Schwikowski. 2006. Graph-based methods for analysing networks in cell biology. Briefings in Bioinformatics 7, 3 (2006), 243–255.

[3]

Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy bridge server. In 2015 IEEE International Symposium on Workload Characterization. 56–65.

Digital Library

[4]

P. Boldi and S. Vigna. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). Association for Computing Machinery, New York, NY, USA, 595–602.

Digital Library

[5]

Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, Ananth Kalyanaraman, and Partha Pratim Pande. 2022. Software/hardware co-design of 3D NoC-based GPU architectures for accelerated graph computations. ACM Transactions on Design Automation of Electronic Systems 27, 6 (June 2022), 61:1–61:22.

Digital Library

[6]

Dwaipayan Choudhury, Lizhi Xiang, Aravind Rajam, Anantharaman Kalyanaraman, and Partha Pratim Pande. 2023. Accelerating graph computations on 3D NoC-enabled PIM architectures. ACM Transactions on Design Automation of Electronic Systems 28, 3 (March 2023), 30:1–30:16.

Digital Library

[7]

Hybrid Memory Cube Consortium et al. 2014. Hybrid Memory Cube Specification 2.1. Technical Report.

[8]

Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. 2019. GraphH: A processing-in-memory architecture for large-scale graph processing. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 38, 4 (April 2019), 640–653.

[9]

Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. 2007. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering 19, 3 (March 2007), 355–369.

Digital Library

[10]

Bai Fujun, Jiang Xiping, Wang Song, Yu Bing, Tan Jie, Zuo Fengguo, Wang Chunjuan, Wang Fan, Long Xiaodong, Yu Guoqing, Fu Ni, Li Qiannan, Li Hua, Wang Kexin, Duan Huifu, Bai Liang, Jia Xuerong, Li Jin, Li Mei, Wang Zhengwen, Hu Sheng, Zhou Jun, Zhan Qiong, Sun Peng, Yang Daohong, Cheichan Kau, David Yang, Ching-Sung Ho, Sun Hongbin, Lv Hangbing, Liu Ming, Kang Yi, and Ren Qiwei. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In 2020 IEEE International Electron Devices Meeting (IEDM’20). 6.6.1–6.6.4.

[11]

Seyed Ali Ghasemi, Belal Jahannia, and Hamed Farbeh. 2022. GraphA: An efficient ReRAM-based architecture to accelerate large scale graph processing. Journal of Systems Architecture 133 (Dec. 2022), 102755.

Digital Library

[12]

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 17–30.

Digital Library

[13]

Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, and Theodore L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 395–404.

Digital Library

[14]

Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob H. Bisseling. 1998. BSPlib: The BSP programming library. Parallel Computing 24, 14 (Dec. 1998), 1947–1980.

Digital Library

[15]

Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 78–88.

Digital Library

[16]

S. Hynix. 2021. Hbm2e. https://product.skhynix.com/products/dram/hbm/hbm2e.go

[17]

Shunning Jiang, Peitian Pan, Yanghui Ou, and Christopher Batten. 2020. PyMTL3: A python framework for open-source hardware modeling, generation, simulation, and verification. IEEE Micro 40, 4 (July 2020), 58–66.

Digital Library

[18]

Xiping Jiang, Fengguo Zuo, Song Wang, Xiaofeng Zhou, Yubing Wang, Qi Liu, Qiwei Ren, and Ming Liu. 2022. A 1596-GB/s 48-Gb stacked embedded DRAM 384-Core SoC with hybrid bonding integration. IEEE Solid-state Circuits Letters 5 (2022), 110–113.

[19]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (Jan. 1998), 359–392.

Digital Library

[20]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). Association for Computing Machinery, New York, NY, USA, 591–600.

Digital Library

[21]

Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 31–46.

Digital Library

[22]

Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2020. GPOP: A scalable cache- and memory-efficient framework for graph processing over parts. ACM Transactions on Parallel Computing 7, 1 (March 2020), 7:1–7:24.

Digital Library

[23]

Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web 1, 1 (May 2007), 5–es.

Digital Library

[24]

Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 1 (Jan. 2009), 29–123.

[25]

Shuchuan Lo and Chingching Lin. 2006. WMR—A graph-based algorithm for friend recommendation. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI’06). IEEE Computer Society, 121–128.

Digital Library

[26]

Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716–727.

Digital Library

[27]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). Association for Computing Machinery, New York, NY, USA, 135–146.

Digital Library

[28]

Julian McAuley and Jure Leskovec. 2012. Learning to discover social circles in ego networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 539–547.

Digital Library

[29]

Newton, Virendra Singh, and Trevor E. Carlson. 2020. PIM-GraphSCC: PIM-based graph processing using graph’s community structures. IEEE Computer Architecture Letters 19, 2 (July 2020), 151–154.

Digital Library

[30]

Fabio Petroni, Leonardo Querzoni, Khuzaima Daudjee, Shahin Kamali, and Giorgio Iacoboni. 2015. HDRF: Stream-based partitioning for power-law graphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). Association for Computing Machinery, New York, NY, USA, 243–252.

Digital Library

[31]

Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Association for Computing Machinery, New York, NY, USA, 472–488.

Digital Library

[32]

Kihwan Seong, Donguk Park, Gyeomje Bae, Hyunwoo Lee, Youngseob Suh, Wooseuk Oh, Hyemun Lee, Juyoung Kim, Takgun Lee, Geonhoo Mo, Sukhyun Jung, Dongcheol Choi, Byoung-Joo Yoo, Sanghune Park, Hyo-Gyuem Rhew, and Jongshin Shin. 2023. A 4nm 32Gb/s 8Tb/s/Mm Die-to-Die chiplet using NRZ single-ended transceiver with equalization schemes and training techniques. In 2023 IEEE International Solid- State Circuits Conference (ISSCC’23). IEEE, San Francisco, CA, USA, 114–116.

[33]

Isabelle Stanton and Gabriel Kliot. 2012. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). Association for Computing Machinery, New York, NY, USA, 1222–1230.

Digital Library

[34]

Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1214–1225.

Digital Library

[35]

Lei Tang and Huan Liu. 2010. Graph mining applications to social network analysis. In Managing and Mining Graph Data, Charu C. Aggarwal and Haixun Wang (Eds.). Springer US, Boston, MA, 487–513.

[36]

Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). Association for Computing Machinery, New York, NY, USA, 333–342.

Digital Library

[37]

Junpeng Wang, Haitao Du, Bo Ding, Qi Xu, Song Chen, and Yi Kang. 2023. DDAM: Data distribution-aware mapping of CNNs on processing-in-memory systems. ACM Transactions on Design Automation of Electronic Systems 28, 3 (March 2023), 36:1–36:30.

Digital Library

[38]

Tianyi Wang, Yang Chen, Zengbin Zhang, Tianyin Xu, Long Jin, Pan Hui, Beixing Deng, and Xing Li. 2011. Understanding graph sampling algorithms for social network analysis. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops (ICDCSW’11). IEEE Computer Society, 123–128.

Digital Library

[39]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. ACM SIGPLAN Notices 51, 8 (Feb. 2016), 11:1–11:12.

Digital Library

[40]

Joyce Jiyoung Whang, Andrew Lenharth, Inderjit S. Dhillon, and Keshav Pingali. 2015. Scalable data-driven PageRank: Algorithms, system issues, and lessons learned. In Euro-Par 2015: Parallel Processing (Lecture Notes in Computer Science), Jesper Larsson Träff, Sascha Hunold, and Francesco Versaci (Eds.). Springer, Berlin, 438–450.

[41]

Cong Xie, Ling Yan, Wu-Jun Li, and Zhihua Zhang. 2014. Distributed power-law graph computing: Theoretical and empirical analysis. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’14). MIT Press, Cambridge, MA, USA, 1673–1681.

[42]

Zihui Xue, Yuedong Yang, and Radu Marculescu. 2023. SUGAR: Efficient subgraph-level training via resource-aware graph partitioning. IEEE Transactions on Computers 72, 11 (Nov. 2023), 3167–3177.

Digital Library

[43]

Mingyu Yan, Xing Hu, Shuangchen Li, Abanti Basak, Han Li, Xin Ma, Itir Akgun, Yujing Feng, Peng Gu, Lei Deng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2019. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 615–628.

Digital Library

[44]

Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A scalable accelerator for massively parallel graph processing. In 2022 IEEE International Symposium on High-performance Computer Architecture (HPCA’22). 199–212.

[45]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 544–557.

[46]

Minxuan Zhou, Muzhou Li, Mohsen Imani, and Tajana Rosing. 2021. HyGraph: Accelerating graph processing with hybrid memory-centric computing. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE’21). 330–335.

[47]

Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A computation-centric distributed graph processing system. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 301–316.

Digital Library

[48]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC’15). 375–386.

[49]

Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 712–725.

Digital Library

Cited By

Kaur RAsad AMohammadi F(2024)A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural NetworksComputers10.3390/computers1307017413:7(174)Online publication date: 16-Jul-2024
https://doi.org/10.3390/computers13070174
Zhao XGao TZhao ABi ZYan CYang FWang SZhou DZeng X(2024)ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space ExplorationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344300643:11(4178-4189)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3443006

Index Terms

Load Balanced PIM-Based Graph Processing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. 3D integrated circuits

Recommendations

GraphQ: Scalable PIM-Based Graph Processing
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Processing-In-Memory (PIM) architectures based on recent technology advances (e.g., Hybrid Memory Cube) demonstrate great potential for graph processing. However, existing solutions did not address the key challenge of graph processing---irregular data ...
Accelerating Graph Computations on 3D NoC-Enabled PIM Architectures
Graph application workloads are dominated by random memory accesses with the poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM ...
A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal
Special Issue on NVM and Storage

Graph algorithms such as graph traversal have been gaining ever-increasing importance in the era of big data. However, graph processing on traditional architectures issues many random and irregular memory accesses, leading to a huge number of data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 29, Issue 4

July 2024

360 pages

EISSN:1557-7309

DOI:10.1145/3613660

Editor:
Jiang Hu
Texas A&M University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 21 June 2024

Online AM: 18 April 2024

Accepted: 08 April 2024

Revised: 17 March 2024

Received: 27 October 2023

Published in TODAES Volume 29, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Strategic Priority Research Program of Chinese Academy of Sciences

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)406
Downloads (Last 6 weeks)76

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kaur RAsad AMohammadi F(2024)A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural NetworksComputers10.3390/computers1307017413:7(174)Online publication date: 16-Jul-2024
https://doi.org/10.3390/computers13070174
Zhao XGao TZhao ABi ZYan CYang FWang SZhou DZeng X(2024)ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space ExplorationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344300643:11(4178-4189)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3443006

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents