More Web Proxy on the site http://driver.im/

research-article

GraphReduce: processing large-scale graphs on accelerator-based systems

Authors:

Dipanjan Sengupta,

Shuaiwen Leon Song,

Karsten SchwanAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 28, Pages 1 - 12

https://doi.org/10.1145/2807591.2807655

Published: 15 November 2015 Publication History

Abstract

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.

References

[1]

Cage15: www.cise.ufl.edu/research/sparse/matrices/vanheukelum.

[2]

CUDA 7.0: https://developer.nvidia.com/cuda-downloads/.

[3]

DIMACS10 ak2010: www.cise.ufl.edu/research/sparse/matrices/dimacs10.

[4]

DIMACS10 belgium_osm: www.cise.ufl.edu/research/sparse/matrices/dimacs10.

[5]

DIMACS10 coAuthorsDBLP: www.cc.gatech.edu/dimacs10/data/coauthor/.

[6]

Modern GPU library:http://nvlabs.github.io/moderngpu/.

[7]

NVIDIA Kepler GK110 white paper. 2012.

[8]

uk-2002: http://law.di.unimi.it/webdata/uk-2002/.

[9]

Unified Virtual Addressing: http://devblogs.nvidia.com/parallelforall/unified-memoryin-cuda-6/.

[10]

VertexAPI2: http://www.royal-caliber.com/main.html. 2015.

[11]

Yahoo WebScope: G7 webscope.sandbox.yahoo.com/catalog.php?datatype=g.

[12]

N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.

[13]

H. Djidjev, S. Thulasidasan, G. Chapuis, R. Andonov, and D. Lavenier. Efficient multi-gpu computation of all-pairs shortest paths. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 360--369, May 2014.

Digital Library

[14]

N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.-D. Nguyen. Parallel pagerank computation using gpus. In Proceedings of the Third Symposium on Information and Communication Technology, SoICT '12, pages 223--230, New York, NY, USA, 2012. ACM.

Digital Library

[15]

N. Farooqui, C. J. Rossbach, Y. Yu, and K. Schwan. Leo: A profile-driven dynamic optimization framework for gpu applications. In 2014 Conference on Timely Results in Operating Systems (TRIOS 14), Broomfield, CO, Oct. 2014. USENIX Association.

Digital Library

[16]

N. Farooqui, K. Schwan, and S. Yalamanchili. Efficient instrumentation of gpgpu applications using information flow analysis and symbolic execution. In Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 19:19--19:27, New York, NY, USA, 2014. ACM.

[17]

Z. Fu, M. Personick, and B. Thompson. Mapgraph: A high level api for fast development of high performance graph analytics on gpus. In Proceedings of Workshop on GRAph Data Management Experiences and Systems, GRADES'14, pages 2:1--2:6, New York, NY, USA, 2014. ACM.

Digital Library

[18]

A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 345--354, New York, NY, USA, 2012. ACM.

Digital Library

[19]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 17--30, Hollywood, CA, 2012. USENIX.

Digital Library

[20]

S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: A dsl for easy and efficient graph analysis. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 349--362, New York, NY, USA, 2012. ACM.

Digital Library

[21]

S. Hong, T. Oguntebi, and K. Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 78--88, Oct 2011.

Digital Library

[22]

F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 239--252, New York, NY, USA, 2014. ACM.

Digital Library

[23]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 31--46, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

[24]

J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, 2008.

[25]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.

Digital Library

[26]

Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041, 2014.

Digital Library

[27]

A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5--20, 2007.

[28]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.

Digital Library

[29]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012. ACM.

Digital Library

[30]

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. 1999.

[31]

R. A. Rossi and N. K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[32]

A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 472--488, New York, NY, USA, 2013. ACM.

Digital Library

[33]

O. Schenk, A. WÃd'chter, and M. Weiser. Inertia-revealing preconditioning for large-scale nonconvex constrained optimization. SIAM Journal on Scientific Computing, 31(2):939--960, 2009.

Digital Library

[34]

D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on gpgpu-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, VTDC '13, pages 3--10, New York, NY, USA, 2013. ACM.

Digital Library

[35]

D. Sengupta, A. Goswami, K. Schwan, and K. Pallavi. Scheduling multi-tenant cloud workloads on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 513--524, Piscataway, NJ, USA, 2014. IEEE Press.

Digital Library

[36]

D. Sengupta, A. Kapil, S. Song, and K. Schwan. Graphreduce: Large-scale graph analytics on accelerator-based hpc systems. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE 28th International, May 2015.

Digital Library

[37]

S. Song and K. Cameron. System-level power-performance efficiency modeling for emergent gpu architectures. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 473--474, New York, NY, USA, 2012. ACM.

Digital Library

[38]

S. Song, C. Su, B. Rountree, and K. Cameron. A simplified and accurate model of power-performance efficiency on emergent gpu architectures. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 673--686, May 2013.

Digital Library

[39]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.

Digital Library

[40]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, pages 38:1--38:12, New York, NY, USA, 2007. ACM.

Digital Library

[41]

J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS '12, pages 3:1--3:8, New York, NY, USA, 2012. ACM.

Digital Library

[42]

A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[43]

J. Zhong and B. He. Medusa: A parallel graph processing system on graphics processors. 2013.

Cited By

Park SOh SKim MLee IChabbi MSteuwer M(2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638490
Mandulak MGhosh SFerdous SHalappanvar MSlota G(2024)Efficient Weighted Graph Matching on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00024(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00024
Cheng QZheng ZJiang TTang CWang TGong LWang CZhou X(2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00021
Show More Cited By

Index Terms

GraphReduce: processing large-scale graphs on accelerator-based systems

Recommendations

GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorising k-Core Decomposition for GPU Acceleration
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

k-Core decomposition is a well-studied community detection problem in graph analytics in which each k-core of vertices induces a subgraph where all vertices have degree at least k. The decomposition is expensive to compute on large graphs and efforts to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Intel Science and Technology Center in Cloud Computing

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
860
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)6

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Park SOh SKim MLee IChabbi MSteuwer M(2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638490
Mandulak MGhosh SFerdous SHalappanvar MSlota G(2024)Efficient Weighted Graph Matching on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00024(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00024
Cheng QZheng ZJiang TTang CWang TGong LWang CZhou X(2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00021
Cheng QWang TWang C(2024)Ph.D. Project: Optimizing the Data Traffic for Large Graph Processing on FPGA via a Stateful Approach2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00047(1-2)Online publication date: 5-May-2024
https://doi.org/10.1109/FCCM60383.2024.00047
Wang QChen YWong WHe B(2023)HongTu: Scalable Full-Graph GNN Training on Multiple GPUsProceedings of the ACM on Management of Data10.1145/36267331:4(1-27)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626733
Sano SBando YHiwada KKajihara HSuzuki TNakanishi YTaki DKaneko AShiozawa T(2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624173
Xia YJiang PAgrawal GRamnath RDehnavi MKulkarni MKrishnamoorthy S(2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577486
Li STang RZhu JZhao ZGong XWang WZhang JYew P(2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326866234:6(1954-1967)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3268662
Zhang LLu KLai ZFu YTang YLi D(2023)Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous ArchitecturesIEEE Transactions on Computers10.1109/TC.2023.330507772:12(3473-3488)Online publication date: Dec-2023
https://doi.org/10.1109/TC.2023.3305077
Xing YLi YWang ZXu YLui J(2023)LightTraffic: On Optimizing CPU-GPU Data Traffic for Efficient Large-scale Random Walks2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00073(882-895)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00073
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents