[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Load Balanced PIM-Based Graph Processing

Published: 21 June 2024 Publication History

Abstract

Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance.
This article proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation–communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2× and maximum 2.8× speedup compared to the SOTA PIM-based graph processing system GraphQ.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA’15). 105–117.
[2]
Tero Aittokallio and Benno Schwikowski. 2006. Graph-based methods for analysing networks in cell biology. Briefings in Bioinformatics 7, 3 (2006), 243–255.
[3]
Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy bridge server. In 2015 IEEE International Symposium on Workload Characterization. 56–65.
[4]
P. Boldi and S. Vigna. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). Association for Computing Machinery, New York, NY, USA, 595–602.
[5]
Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, Ananth Kalyanaraman, and Partha Pratim Pande. 2022. Software/hardware co-design of 3D NoC-based GPU architectures for accelerated graph computations. ACM Transactions on Design Automation of Electronic Systems 27, 6 (June 2022), 61:1–61:22.
[6]
Dwaipayan Choudhury, Lizhi Xiang, Aravind Rajam, Anantharaman Kalyanaraman, and Partha Pratim Pande. 2023. Accelerating graph computations on 3D NoC-enabled PIM architectures. ACM Transactions on Design Automation of Electronic Systems 28, 3 (March 2023), 30:1–30:16.
[7]
Hybrid Memory Cube Consortium et al. 2014. Hybrid Memory Cube Specification 2.1. Technical Report.
[8]
Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. 2019. GraphH: A processing-in-memory architecture for large-scale graph processing. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 38, 4 (April 2019), 640–653.
[9]
Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. 2007. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering 19, 3 (March 2007), 355–369.
[10]
Bai Fujun, Jiang Xiping, Wang Song, Yu Bing, Tan Jie, Zuo Fengguo, Wang Chunjuan, Wang Fan, Long Xiaodong, Yu Guoqing, Fu Ni, Li Qiannan, Li Hua, Wang Kexin, Duan Huifu, Bai Liang, Jia Xuerong, Li Jin, Li Mei, Wang Zhengwen, Hu Sheng, Zhou Jun, Zhan Qiong, Sun Peng, Yang Daohong, Cheichan Kau, David Yang, Ching-Sung Ho, Sun Hongbin, Lv Hangbing, Liu Ming, Kang Yi, and Ren Qiwei. 2020. A stacked embedded DRAM array for LPDDR4/4X using hybrid bonding 3D integration with 34GB/s/1Gb 0.88pJ/b logic-to-memory interface. In 2020 IEEE International Electron Devices Meeting (IEDM’20). 6.6.1–6.6.4.
[11]
Seyed Ali Ghasemi, Belal Jahannia, and Hamed Farbeh. 2022. GraphA: An efficient ReRAM-based architecture to accelerate large scale graph processing. Journal of Systems Architecture 133 (Dec. 2022), 102755.
[12]
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 17–30.
[13]
Yong Guo, Marcin Biczak, Ana Lucia Varbanescu, Alexandru Iosup, Claudio Martella, and Theodore L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 395–404.
[14]
Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob H. Bisseling. 1998. BSPlib: The BSP programming library. Parallel Computing 24, 14 (Dec. 1998), 1947–1980.
[15]
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 78–88.
[17]
Shunning Jiang, Peitian Pan, Yanghui Ou, and Christopher Batten. 2020. PyMTL3: A python framework for open-source hardware modeling, generation, simulation, and verification. IEEE Micro 40, 4 (July 2020), 58–66.
[18]
Xiping Jiang, Fengguo Zuo, Song Wang, Xiaofeng Zhou, Yubing Wang, Qi Liu, Qiwei Ren, and Ming Liu. 2022. A 1596-GB/s 48-Gb stacked embedded DRAM 384-Core SoC with hybrid bonding integration. IEEE Solid-state Circuits Letters 5 (2022), 110–113.
[19]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (Jan. 1998), 359–392.
[20]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). Association for Computing Machinery, New York, NY, USA, 591–600.
[21]
Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 31–46.
[22]
Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2020. GPOP: A scalable cache- and memory-efficient framework for graph processing over parts. ACM Transactions on Parallel Computing 7, 1 (March 2020), 7:1–7:24.
[23]
Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web 1, 1 (May 2007), 5–es.
[24]
Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 1 (Jan. 2009), 29–123.
[25]
Shuchuan Lo and Chingching Lin. 2006. WMR—A graph-based algorithm for friend recommendation. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI’06). IEEE Computer Society, 121–128.
[26]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716–727.
[27]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). Association for Computing Machinery, New York, NY, USA, 135–146.
[28]
Julian McAuley and Jure Leskovec. 2012. Learning to discover social circles in ego networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 539–547.
[29]
Newton, Virendra Singh, and Trevor E. Carlson. 2020. PIM-GraphSCC: PIM-based graph processing using graph’s community structures. IEEE Computer Architecture Letters 19, 2 (July 2020), 151–154.
[30]
Fabio Petroni, Leonardo Querzoni, Khuzaima Daudjee, Shahin Kamali, and Giorgio Iacoboni. 2015. HDRF: Stream-based partitioning for power-law graphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). Association for Computing Machinery, New York, NY, USA, 243–252.
[31]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Association for Computing Machinery, New York, NY, USA, 472–488.
[32]
Kihwan Seong, Donguk Park, Gyeomje Bae, Hyunwoo Lee, Youngseob Suh, Wooseuk Oh, Hyemun Lee, Juyoung Kim, Takgun Lee, Geonhoo Mo, Sukhyun Jung, Dongcheol Choi, Byoung-Joo Yoo, Sanghune Park, Hyo-Gyuem Rhew, and Jongshin Shin. 2023. A 4nm 32Gb/s 8Tb/s/Mm Die-to-Die chiplet using NRZ single-ended transceiver with equalization schemes and training techniques. In 2023 IEEE International Solid- State Circuits Conference (ISSCC’23). IEEE, San Francisco, CA, USA, 114–116.
[33]
Isabelle Stanton and Gabriel Kliot. 2012. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). Association for Computing Machinery, New York, NY, USA, 1222–1230.
[34]
Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1214–1225.
[35]
Lei Tang and Huan Liu. 2010. Graph mining applications to social network analysis. In Managing and Mining Graph Data, Charu C. Aggarwal and Haixun Wang (Eds.). Springer US, Boston, MA, 487–513.
[36]
Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). Association for Computing Machinery, New York, NY, USA, 333–342.
[37]
Junpeng Wang, Haitao Du, Bo Ding, Qi Xu, Song Chen, and Yi Kang. 2023. DDAM: Data distribution-aware mapping of CNNs on processing-in-memory systems. ACM Transactions on Design Automation of Electronic Systems 28, 3 (March 2023), 36:1–36:30.
[38]
Tianyi Wang, Yang Chen, Zengbin Zhang, Tianyin Xu, Long Jin, Pan Hui, Beixing Deng, and Xing Li. 2011. Understanding graph sampling algorithms for social network analysis. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops (ICDCSW’11). IEEE Computer Society, 123–128.
[39]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. ACM SIGPLAN Notices 51, 8 (Feb. 2016), 11:1–11:12.
[40]
Joyce Jiyoung Whang, Andrew Lenharth, Inderjit S. Dhillon, and Keshav Pingali. 2015. Scalable data-driven PageRank: Algorithms, system issues, and lessons learned. In Euro-Par 2015: Parallel Processing (Lecture Notes in Computer Science), Jesper Larsson Träff, Sascha Hunold, and Francesco Versaci (Eds.). Springer, Berlin, 438–450.
[41]
Cong Xie, Ling Yan, Wu-Jun Li, and Zhihua Zhang. 2014. Distributed power-law graph computing: Theoretical and empirical analysis. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’14). MIT Press, Cambridge, MA, USA, 1673–1681.
[42]
Zihui Xue, Yuedong Yang, and Radu Marculescu. 2023. SUGAR: Efficient subgraph-level training via resource-aware graph partitioning. IEEE Transactions on Computers 72, 11 (Nov. 2023), 3167–3177.
[43]
Mingyu Yan, Xing Hu, Shuangchen Li, Abanti Basak, Han Li, Xin Ma, Itir Akgun, Yujing Feng, Peng Gu, Lei Deng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2019. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 615–628.
[44]
Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A scalable accelerator for massively parallel graph processing. In 2022 IEEE International Symposium on High-performance Computer Architecture (HPCA’22). 199–212.
[45]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 544–557.
[46]
Minxuan Zhou, Muzhou Li, Mohsen Imani, and Tajana Rosing. 2021. HyGraph: Accelerating graph processing with hybrid memory-centric computing. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE’21). 330–335.
[47]
Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A computation-centric distributed graph processing system. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 301–316.
[48]
Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In 2015 USENIX Annual Technical Conference (USENIX ATC’15). 375–386.
[49]
Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’52). Association for Computing Machinery, New York, NY, USA, 712–725.

Cited By

View all
  • (2024)A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural NetworksComputers10.3390/computers1307017413:7(174)Online publication date: 16-Jul-2024
  • (2024)ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space ExplorationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344300643:11(4178-4189)Online publication date: 1-Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 29, Issue 4
July 2024
360 pages
EISSN:1557-7309
DOI:10.1145/3613660
  • Editor:
  • Jiang Hu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 21 June 2024
Online AM: 18 April 2024
Accepted: 08 April 2024
Revised: 17 March 2024
Received: 27 October 2023
Published in TODAES Volume 29, Issue 4

Check for updates

Author Tags

  1. Graph analytics
  2. processing-in-memory
  3. on-chip network
  4. load balancing

Qualifiers

  • Research-article

Funding Sources

  • Strategic Priority Research Program of Chinese Academy of Sciences

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)406
  • Downloads (Last 6 weeks)76
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Comprehensive Review of Processing-in-Memory Architectures for Deep Neural NetworksComputers10.3390/computers1307017413:7(174)Online publication date: 16-Jul-2024
  • (2024)ROI-HIT: Region of Interest-Driven High-Dimensional Microarchitecture Design Space ExplorationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344300643:11(4178-4189)Online publication date: 1-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media