[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2807591.2807655acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

GraphReduce: processing large-scale graphs on accelerator-based systems

Published: 15 November 2015 Publication History

Abstract

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.

References

[1]
Cage15: www.cise.ufl.edu/research/sparse/matrices/vanheukelum.
[2]
CUDA 7.0: https://developer.nvidia.com/cuda-downloads/.
[3]
DIMACS10 ak2010: www.cise.ufl.edu/research/sparse/matrices/dimacs10.
[4]
DIMACS10 belgium_osm: www.cise.ufl.edu/research/sparse/matrices/dimacs10.
[5]
DIMACS10 coAuthorsDBLP: www.cc.gatech.edu/dimacs10/data/coauthor/.
[6]
Modern GPU library:http://nvlabs.github.io/moderngpu/.
[7]
NVIDIA Kepler GK110 white paper. 2012.
[8]
uk-2002: http://law.di.unimi.it/webdata/uk-2002/.
[9]
Unified Virtual Addressing: http://devblogs.nvidia.com/parallelforall/unified-memoryin-cuda-6/.
[10]
VertexAPI2: http://www.royal-caliber.com/main.html. 2015.
[11]
Yahoo WebScope: G7 webscope.sandbox.yahoo.com/catalog.php?datatype=g.
[12]
N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.
[13]
H. Djidjev, S. Thulasidasan, G. Chapuis, R. Andonov, and D. Lavenier. Efficient multi-gpu computation of all-pairs shortest paths. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 360--369, May 2014.
[14]
N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.-D. Nguyen. Parallel pagerank computation using gpus. In Proceedings of the Third Symposium on Information and Communication Technology, SoICT '12, pages 223--230, New York, NY, USA, 2012. ACM.
[15]
N. Farooqui, C. J. Rossbach, Y. Yu, and K. Schwan. Leo: A profile-driven dynamic optimization framework for gpu applications. In 2014 Conference on Timely Results in Operating Systems (TRIOS 14), Broomfield, CO, Oct. 2014. USENIX Association.
[16]
N. Farooqui, K. Schwan, and S. Yalamanchili. Efficient instrumentation of gpgpu applications using information flow analysis and symbolic execution. In Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 19:19--19:27, New York, NY, USA, 2014. ACM.
[17]
Z. Fu, M. Personick, and B. Thompson. Mapgraph: A high level api for fast development of high performance graph analytics on gpus. In Proceedings of Workshop on GRAph Data Management Experiences and Systems, GRADES'14, pages 2:1--2:6, New York, NY, USA, 2014. ACM.
[18]
A. Gharaibeh, L. Beltrão Costa, E. Santos-Neto, and M. Ripeanu. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 345--354, New York, NY, USA, 2012. ACM.
[19]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 17--30, Hollywood, CA, 2012. USENIX.
[20]
S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: A dsl for easy and efficient graph analysis. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 349--362, New York, NY, USA, 2012. ACM.
[21]
S. Hong, T. Oguntebi, and K. Olukotun. Efficient parallel graph exploration on multi-core cpu and gpu. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 78--88, Oct 2011.
[22]
F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 239--252, New York, NY, USA, 2014. ACM.
[23]
A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 31--46, Berkeley, CA, USA, 2012. USENIX Association.
[24]
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, 2008.
[25]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS '15, pages 67--77, New York, NY, USA, 2015. ACM.
[26]
Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041, 2014.
[27]
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(01):5--20, 2007.
[28]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.
[29]
D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012. ACM.
[30]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. 1999.
[31]
R. A. Rossi and N. K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[32]
A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 472--488, New York, NY, USA, 2013. ACM.
[33]
O. Schenk, A. WÃd'chter, and M. Weiser. Inertia-revealing preconditioning for large-scale nonconvex constrained optimization. SIAM Journal on Scientific Computing, 31(2):939--960, 2009.
[34]
D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on gpgpu-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, VTDC '13, pages 3--10, New York, NY, USA, 2013. ACM.
[35]
D. Sengupta, A. Goswami, K. Schwan, and K. Pallavi. Scheduling multi-tenant cloud workloads on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 513--524, Piscataway, NJ, USA, 2014. IEEE Press.
[36]
D. Sengupta, A. Kapil, S. Song, and K. Schwan. Graphreduce: Large-scale graph analytics on accelerator-based hpc systems. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE 28th International, May 2015.
[37]
S. Song and K. Cameron. System-level power-performance efficiency modeling for emergent gpu architectures. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 473--474, New York, NY, USA, 2012. ACM.
[38]
S. Song, C. Su, B. Rountree, and K. Cameron. A simplified and accurate model of power-performance efficiency on emergent gpu architectures. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 673--686, May 2013.
[39]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.
[40]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, pages 38:1--38:12, New York, NY, USA, 2007. ACM.
[41]
J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS '12, pages 3:1--3:8, New York, NY, USA, 2012. ACM.
[42]
A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, SC '05, pages 25--, Washington, DC, USA, 2005. IEEE Computer Society.
[43]
J. Zhong and B. He. Medusa: A parallel graph processing system on graphics processors. 2013.

Cited By

View all
  • (2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
  • (2024)Efficient Weighted Graph Matching on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00024(1-16)Online publication date: 17-Nov-2024
  • (2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. big data
  3. data movement optimization
  4. graph analytics
  5. performance optimization

Qualifiers

  • Research-article

Funding Sources

  • Intel Science and Technology Center in Cloud Computing

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)INFINEL: An efficient GPU-based processing method for unpredictable large output graph queriesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638490(147-159)Online publication date: 2-Mar-2024
  • (2024)Efficient Weighted Graph Matching on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00024(1-16)Online publication date: 17-Nov-2024
  • (2024)SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00021(87-91)Online publication date: 2-Sep-2024
  • (2024)Ph.D. Project: Optimizing the Data Traffic for Large Graph Processing on FPGA via a Stateful Approach2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00047(1-2)Online publication date: 5-May-2024
  • (2023)HongTu: Scalable Full-Graph GNN Training on Multiple GPUsProceedings of the ACM on Management of Data10.1145/36267331:4(1-27)Online publication date: 12-Dec-2023
  • (2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
  • (2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
  • (2023)Liberator: A Data Reuse Framework for Out-of-Memory Graph Computing on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326866234:6(1954-1967)Online publication date: 1-Jun-2023
  • (2023)Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous ArchitecturesIEEE Transactions on Computers10.1109/TC.2023.330507772:12(3473-3488)Online publication date: Dec-2023
  • (2023)LightTraffic: On Optimizing CPU-GPU Data Traffic for Efficient Large-scale Random Walks2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00073(882-895)Online publication date: Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media