CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube
<p>Overview and structure of the Hybrid Memory Cube (HMC). DRAM: Dynamic Random-Access Memory; TSV: Through-Silicon Via. (<b>a</b>) Overview of the HMC. (<b>b</b>) Structure of the HMC.</p> "> Figure 2
<p>Compressed sparse row-based graph traversal overflow. The notation such as <math display="inline"><semantics> <mrow> <mi>a</mi> <mo>−</mo> <mi>b</mi> </mrow> </semantics></math> means an edge from vertex <span class="html-italic">a</span> to vertex <span class="html-italic">b</span>. Part of the timeline for traversing this sample graph is compared between using conventional memory and CGAcc in the following sections.</p> "> Figure 3
<p>Speedup of graph traversal benchmarks without prefetch (<tt>N_Pref</tt>) and with stream prefetch (<tt>Pref</tt>).</p> "> Figure 4
<p>Stall ratio of graph traversal benchmarks without prefetch (<tt>N_Pref</tt>) and with stream prefetch (<tt>Pref</tt>).</p> "> Figure 5
<p>NStall ratio of graph traversal benchmarks without prefetch (<tt>N_Pref</tt>) and with stream prefetch (<tt>Pref</tt>).</p> "> Figure 6
<p>L1 miss rate graph traversal benchmarks without prefetch (<tt>N_Pref</tt>) and with stream prefetch (<tt>Pref</tt>).</p> "> Figure 7
<p>Overview of CGAcc. EB: Edge Buffer; EP: Edge Prefetcher; PB: Prefetch Buffer; VB: Vertex Buffer; VEP: Vertex Prefetcher; VSB: Visited Buffer; VSP: Visited Prefetcher; EC: Edge prefetch Cache; VSC: Visited prefetch Cache.</p> "> Figure 8
<p>Example timeline of steps for graph traversal. This figure compares a part of the timeline for traversing the sample graph in <a href="#electronics-07-00307-f002" class="html-fig">Figure 2</a> by using a conventional memory and CGAcc.</p> "> Figure 9
<p>BFSworkflow from programmer to hardware.</p> "> Figure 10
<p>Comparison of the performance between baseline and CGAcc.</p> "> Figure 11
<p>Comparison of the performance with/without cache optimization.</p> "> Figure 12
<p>Comparison of the performance on a few benchmarks between graph prefetcher and CGAcc.</p> "> Figure 13
<p>Comparison of the performance for different on-chip cache capacities.</p> "> Figure 14
<p>Comparison of the on-chip cache hit rate for different on-chip cache capacities.</p> "> Figure 15
<p>Performance comparison for different on-chip cache associativities.</p> "> Figure 16
<p>Comparison of the on-chip cache hit rate for different on-chip cache associativities.</p> "> Figure 17
<p>Comparison of performance between the baseline and CGAcc on edges.</p> "> Figure 18
<p>Comparison of the performance between baseline and CGAcc on vertexes.</p> "> Figure 19
<p>Comparison of the performance between the baseline, graph prefetcher and CGAcc on BFS-like applications. GC, Graph Coloring; SSSP, Single-Source-Shortest-Path.</p> "> Figure 20
<p>Comparison of performance between baseline, graph prefetcherand CGAcc on sequential-iteration applications.</p> "> Figure 21
<p>Entries’ consumption for each buffer. Blue, red, yellow and green lines refer to VEB, EB, VSB and PB, respectively.</p> ">
Abstract
:1. Introduction
- We characterize the performance bottleneck of graph traversal, analyze the benefit of using 3D stacked memory for traversal and motivate the design of CGAcc. In our approach, the memory system is as an active partner rather than a passive or co-processing device as in most previous works.
- We propose CGAcc, a CSR-based graph traversal accelerator for the HMC. This design is based on knowledge of the workflow and structure of graph traversal. CGAcc augments the HMC’s logic layer with prefetching, which operates in a pipeline to reduce transaction latency and data movement cost.
- We evaluate CGAcc under a variety of conditions to consider several design trade-offs. The experimental results demonstrate that CGAcc offers excellent performance improvement with modest hardware cost.
2. Background
2.1. HMC Overview
2.2. Graph Traversal with CSR
2.3. Conventional Prefetching Techniques
- 1
- for to
- 2
- if
- 3
- 4
- Vertex.push(Cur_vertex)
- 5
- 6
- while
- 7
- 8
- 9
- 10
- 11
- for
- 12
- 13
- if
- 14
2.4. Bottleneck in Graph Traversal
3. Architecture
3.1. CGAcc Structure
- (1)
- Register group: A collection of registers that maintains metadata and status information about a graph traversal. First, the Activation Register (AR) is used to enable CGAcc. When the CPU initiates a traversal, it sends an activation request to CGAcc, which includes a start vertex index. This request is recorded in the AR. Second, the Continual Register (CR) is used to store the current maximum start vertex index. This register is needed because a graph may contain several unconnected subgraphs. When the current subgraph is finished traversing, the address in CR will be used as the start vertex of the following traversal. Lastly, the End subgraph Register (ER) is used to record the end of the currently processed subgraph. For some on-line algorithms that only need partial traversal, there is no essential difference for CGAcc in traversing a complete graph or a partial graph. CGAcc just keeps fetching data from these three arrays on the memory side and sending the traversal order in the runtime. The CPU can stop the traversal procedure by setting the ER register if only partial traversal is needed.
- (2)
- Prefetch group: The core of CGAcc. Because CSR represents graphs with three arrays, elements from these arrays can be prefetched by separate prefetchers. Thus, the prefetch group includes the Vertex Prefetcher (VEP), Edge Prefetcher (EP) and Visited Prefetcher (VSP). VEP receives and uses a new vertex index to access the visited array and to fetch vertex data, according to the visited status. The VEP reads the AR to start and then reads the CR to get the address of the next start vertex when notified that processing of the current subgraph is finished. In other cases, the VEP receives requests that contain the new vertex index from the VSP. When vertex data are fetched, the VEP will send some requests to the EP to fetch edge data. After edge data (extended by the current processing vertex) are fetched, the EP will send a request to the VSP to fetch visited data. It receives a request from the EP and then determines whether this vertex is new by simply snooping to see if there exists a write access. The only situation where write access is issued to a visited array is when a new vertex (i.e., never visited before) is visited, and the value in its corresponding location in the visited array will be written as true. In this case, this vertex should be sent to the VEP as an expanded vertex for the following traversal.
- (3)
- Internal cache: Used to reduce transaction latency. The cache is arranged as three small buffers: Vertex prefetch Cache (VEC), Edge prefetch Cache (EC) and Visited prefetch Cache (VSC). These buffers cache a portion of the vertex, edge and visited arrays. For a memory access by a particular prefetcher, the corresponding cache is accessed first. The data are directly fetched on a cache hit. Otherwise, the prefetcher associated with the array performs a memory access to the DRAM. The EC and VSC use Least Recently Used (LRU) replacement. The VEC uses an optimized replacement policy, which is described in Section 3.3. Although these prefetchers are independent, they share cache resources as part of CGAcc. These internal caches store data from different arrays (i.e., vertex, edge and visited arrays). At runtime, every prefetcher can access an arbitrary cache if necessary. For example, the VEP will not only access the data in the VEC, but also data in the VSC because the VEP will handle both the vertex and visited array.
- (4)
- FIFO (First-In, First-Out) buffer: In our design, FIFO buffers (i.e., Vertex Buffer (VEB), Edge Buffer (EB) and Visited Buffer (VSB)) are needed for each prefetcher. Each buffer has entries to hold address information. The value in each entry is evicted after it has been accessed. These buffers store data in a specific way. Each entry in the VEB stores one address. The VEP uses this address to issue two accesses. Each entry in the EB is used to store an address pair (Addrs, Addre). The EP uses this address pair to issue multiple accesses. Finally, each entry in the VSB stores one address, and the VSP uses this address to issue one access.
3.2. CGAcc Operation
3.3. Optimization of On-Chip Cache
3.4. Generalized CGAcc
4. Experimental Setting
4.1. System Configuration
4.2. Workloads
5. Evaluation
5.1. Performance
5.2. Effect on On-Chip Cache
5.3. Effect on Graph Density
5.4. Generalized CGAcc
5.5. CGAcc Prefetch Buffer
5.6. Hardware Overhead
6. Graph Processing- and HMC-Related Work
6.1. Graph Processing-Related Prefetching
6.2. Pointer-Related Fetchers
6.3. HMC as an Accelerator
6.4. Graph Acceleration Architecture
7. Conclusion
Author Contributions
Funding
Conflicts of Interest
References
- Consortium, H. Hybrid Memory Cube Specification 2.1; Hybrid Memory Cube: Claremont, CA, USA, 2014. [Google Scholar]
- Ainsworth, S.; Jones, T.M. Graph prefetching using data structure knowledge. In Proceedings of the 2016 International Conference on Supercomputing, Istanbul, Turkey, 1–3 June 2016; p. 39. [Google Scholar]
- Falsafi, B.; Wenisch, T.F. A primer on hardware prefetching. Synth. Lect. Comput. Archit. 2014, 9, 1–67. [Google Scholar] [CrossRef]
- Tran, H.N.; Cambria, E. A survey of graph processing on graphics processing units. J. Supercomput. 2018, 74, 2086–2115. [Google Scholar] [CrossRef]
- Cooksey, R.; Jourdan, S.; Grunwald, D. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, USA, 5–9 October 2002; Volume 37, pp. 279–290. [Google Scholar]
- Malewicz, G.; Austern, M.H.; Bik, A.J.; Dehnert, J.C.; Horn, I.; Leiser, N.; Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 135–146. [Google Scholar]
- Low, Y.; Bickson, D.; Gonzalez, J.; Guestrin, C.; Kyrola, A.; Hellerstein, J.M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 2012, 5, 716–727. [Google Scholar] [CrossRef]
- Xin, R.S.; Gonzalez, J.E.; Franklin, M.J.; Stoica, I. Graphx: A resilient distributed graph system on spark. In Proceedings of the First International Workshop on Graph Data Management Experiences and Systems, New York, NY, USA, 22–27 June 2013; p. 2. [Google Scholar]
- Corbellini, A.; Mateos, C.; Godoy, D.; Zunino, A.; Schiaffino, S. An architecture and platform for developing distributed recommendation algorithms on large-scale social networks. J. Inf. Sci. 2015, 41, 686–704. [Google Scholar] [CrossRef] [Green Version]
- Corbellini, A.; Godoy, D.; Mateos, C.; Schiaffino, S.; Zunino, A. DPM: A novel distributed large-scale social graph processing framework for link prediction algorithms. Future Gener. Comput. Syst. 2018, 78, 474–480. [Google Scholar] [CrossRef]
- Roth, A.; Sohi, G.S. Effective jump-pointer prefetching for linked data structures. In Proceedings of the EEE Computer Society ACM SIGARCH Computer Architecture News, Atlanta, GA, USA, 1–4 May 1999; Volume 27, pp. 111–121. [Google Scholar]
- Al-Sukhni, H.; Bratt, I.; Connors, D.A. Compiler-directed content-aware prefetching for dynamic data structures. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, New Orleans, LA, USA, 27 September–1 October 2003; pp. 91–100. [Google Scholar]
- Lai, S.C. Hardware-based pointer data prefetcher. In Proceedings of the 21st International Conference on Computer Design, San Jose, CA, USA, 13–15 October 2003; pp. 290–298. [Google Scholar]
- Yu, X.; Hughes, C.J.; Satish, N.; Devadas, S. IMP: Indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, HI, USA, 5–9 December 2015; pp. 178–190. [Google Scholar]
- Nilakant, K.; Dalibard, V.; Roy, A.; Yoneki, E. PrefEdge: SSD prefetcher for large-scale graph traversal. In Proceedings of the International Conference on Systems and Storage, Santa Clara, CA, USA, 2–6 May 2014; pp. 1–12. [Google Scholar]
- Zhang, D.; Ma, X.; Thomson, M.; Chiou, D. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, Williamsburg, VA, USA, 24–28 March 2018; pp. 593–607. [Google Scholar]
- Kim, D.; Kung, J.; Chai, S.; Yalamanchili, S.; Mukhopadhyay, S. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016; pp. 380–392. [Google Scholar]
- Dai, G.; Huang, T.; Chi, Y.; Zhao, J.; Sun, G.; Liu, Y.; Wang, Y.; Xie, Y.; Yang, H. GraphH: A Processing-in-Memory Architecture for Large-scale Graph Processing. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 20. [Google Scholar] [CrossRef]
- Qian, C.; Childers, B.; Huang, L.; Yu, Q.; Wang, Z. HMCSP: Reducing Transaction Latency of CSR-based SPMV in Hybrid Memory Cube. In Proceedings of the 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, UK, 2–4 April 2018; pp. 114–116. [Google Scholar]
- Xu, C.; Wang, C.; Gong, L.; Lu, Y.; Sun, F.; Zhang, Y.; Li, X.; Zhou, X. OmniGraph: A Scalable Hardware Accelerator for Graph Processing. In Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, 5–8 September 2017; pp. 623–624. [Google Scholar]
- Song, L.; Zhuo, Y.; Qian, X.; Li, H.; Chen, Y. GraphR: Accelerating graph processing using ReRAM. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, Austria, 24–28 February 2018; pp. 531–543. [Google Scholar]
- Dogan, H.; Hijaz, F.; Ahmad, M.; Kahne, B.; Wilson, P.; Khan, O. Accelerating graph and machine learning workloads using a shared memory multicore architecture with auxiliary support for in-hardware explicit messaging. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, USA, 29 May–2 June 2017; pp. 254–264. [Google Scholar]
- Ham, T.J.; Wu, L.; Sundaram, N.; Satish, N.; Martonosi, M. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–13. [Google Scholar]
- D’Azevedo, E.F.; Fahey, M.R.; Mills, R.T. Vectorized sparse matrix multiply for compressed row storage format. In Proceedings of the International Conference on Computational Science, Atlanta, GA, USA, 22–25 May 2005; pp. 99–106. [Google Scholar]
- Leskovec, J.; Krevl, A. SNAP: A general-purpose network analysis and graph-mining library. ACM Trans. Intell. Syst. Technol. 2016, 8, 1. [Google Scholar] [CrossRef] [PubMed]
- Cheng, C.; Riley, R.; Kumar, S.P.; Garcia-Luna-Aceves, J.J. A loop-free extended Bellman-Ford routing protocol without bouncing effect. ACM SIGCOMM Comput. Commun. Rev. 1989, 19, 224–236. [Google Scholar] [CrossRef]
- Fanding, D. A Faster Algorithm for Shortest-Path-SPFA. J. Southw. Jiaotong Univ. 1994, 2, 207–212. [Google Scholar]
- Jeon, D.I.; Chung, K.S. Cashmc: A cycle-accurate simulator for hybrid memory cube. IEEE Comput. Archit. Lett. 2017, 16, 10–13. [Google Scholar] [CrossRef]
- Luk, C.K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.; Lowney, G.; Wallace, S.; Reddi, V.J.; Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL, USA, 12–15 June 2005; Volume 40, pp. 190–200. [Google Scholar]
- Wilton, S.J.; Jouppi, N.P. CACTI: An enhanced cache access and cycle time model. IEEE J. Solid-State Circuits 1996, 31, 677–688. [Google Scholar] [CrossRef]
- Nai, L.; Xia, Y.; Tanase, I.G.; Kim, H.; Lin, C.Y. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the 2015 SC-International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, USA, 25–30 June 2015; pp. 1–12. [Google Scholar]
- Murphy, R.C.; Wheeler, K.B.; Barrett, B.W.; Ang, J.A. Introducing the graph 500. Cray Users Group 2010, 19, 45–74. [Google Scholar]
- Lakshminarayana, N.B.; Kim, H. Spare register aware prefetching for graph algorithms on GPUs. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014; pp. 614–625. [Google Scholar]
- Gries, D. Compiler Construction for Digital Computers; Wiley: New York, NY, USA, 1971; Volume 24. [Google Scholar]
- Muchnick, S. Advanced Compiler Design Implementation; Morgan Kaufmann: Burlington, MA, USA, 1997. [Google Scholar]
- Ebrahimi, E.; Mutlu, O.; Patt, Y.N. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture, Raleigh, NC USA, 14–18 February 2009; pp. 7–17. [Google Scholar]
- Nai, L.; Hadidi, R.; Sim, J.; Kim, H.; Kumar, P.; Kim, H. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX, USA, 4–8 February 2017; pp. 457–468. [Google Scholar]
- Aguilera, P.; Zhang, D.P.; Kim, N.S.; Jayasena, N. Fine-Grained Task Migration for Graph Algorithms Using Processing in Memory. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, Chicago, IL, USA, 23–27 May 2016; pp. 489–498. [Google Scholar]
- Ahn, J.; Hong, S.; Yoo, S.; Mutlu, O.; Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Comput. Archit. News 2016, 43, 105–117. [Google Scholar] [CrossRef]
- Hong, B.; Kim, G.; Ahn, J.H.; Kwon, Y.; Kim, H.; Kim, J. Accelerating linked-list traversal through near-data processing. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, Haifa, Israel, 11–15 September 2016; pp. 113–124. [Google Scholar]
- Zhou, S.; Chelmis, C.; Prasanna, V.K. Accelerating large-scale single-source shortest path on FPGA. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), Hyderabad, India, 25–29 May 2015; pp. 129–136. [Google Scholar]
- Attia, O.G.; Grieve, A.; Townsend, K.R.; Jones, P.; Zambreno, J. Accelerating all-pairs shortest path using a message-passing reconfigurable architecture. In Proceedings of the 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Mayan Riviera, Mexico, 7–9 December 2015; pp. 1–6. [Google Scholar]
- Attia, O.G.; Townsend, K.R.; Jones, P.H.; Zambreno, J. A Reconfigurable Architecture for the Detection of Strongly Connected Components. ACM Trans. Reconfig. Technol. Syst. 2016, 9, 16. [Google Scholar] [CrossRef]
- Attia, O.G.; Johnson, T.; Townsend, K.; Jones, P.; Zambreno, J. Cygraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), Phoenix, AZ, USA, 19–23 May 2014; pp. 228–235. [Google Scholar]
Workload | Vertex | Edge | Description |
---|---|---|---|
Wiki | 2,394,385 | 5,021,410 | Wikipedia talk (communication) network |
CA_Road | 1,965,206 | 2,766,607 | Road network of California |
YouTube | 1,134,890 | 2,987,624 | YouTube online social network |
265,214 | 420,045 | Email network from an EU research institution | |
875,713 | 5,105,039 | Web graph from Google | |
Watson | 2,041,302 | 12,203,772 | Watson gene graph |
Amazon | 262,111 | 1,234,877 | Amazon product co-purchasing network from 2 March 2003 |
DBLP | 317,080 | 1,049,866 | DBLP collaboration network |
Knowledge | 138,612 | 1,394,826 | Knowledge graph |
Operations | Event | Action |
---|---|---|
1. Operations in VEP | ||
2. Operations in EP | ||
3. Operations in VSP |
Processor | 8-Core, 2 GHz, In-Order |
---|---|
Cache (for baseline) | L1 Cache: 32 KB, 2-way L2 Cache: 2 MB, 4-way |
Vault controller | close-page, 32 buffer size 16 command queue size |
Link | 4 SerDeslink, 30-Gb/s lane speed 480-GB/s max link bandwidth |
On-chip cache | VEC: 16 KB, direct-mapping, latency: 0.15 ns power: 5.9 mW, area: 0.03 mm EC,VSC: 64 KB, direct-mapping, latency: 0.3 ns power: 21.1 mW, area: 0.07 mm each |
HMC | 32 TSVs, 2.5 Gb/s timing: tCK = 0.8 ns, tRP = 10 tRCD = 13, tCL = 13, tRAS = 27 tWR = 10, tCCD = 4 |
On-chip buffer | VEB, VSB: 1 KB, EB, PB: 32 KB |
16 KB | 32 KB | 64 KB | 128 KB | |
---|---|---|---|---|
Direct-Mapped | 0.167 | 0.227 | 0.316 | 0.431 |
Four-Way-Set Assoc | 0.420 | 0.454 | 0.464 | 0.523 |
Eight-Way-Set Assoc | 0.753 | 0.779 | 0.812 | 0.868 |
Full Assoc | 0.304 | 0.573 | 0.709 | 1.191 |
Graph Case | s16e5 | s16e10 | s16e15 | s16e20 | s16e25 | Gmean |
---|---|---|---|---|---|---|
Speedup | 6.68 | 6.53 | 6.71 | 6.50 | 6.58 | 6.60 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qian, C.; Childers, B.; Huang, L.; Guo, H.; Wang, Z. CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube. Electronics 2018, 7, 307. https://doi.org/10.3390/electronics7110307
Qian C, Childers B, Huang L, Guo H, Wang Z. CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube. Electronics. 2018; 7(11):307. https://doi.org/10.3390/electronics7110307
Chicago/Turabian StyleQian, Cheng, Bruce Childers, Libo Huang, Hui Guo, and Zhiying Wang. 2018. "CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube" Electronics 7, no. 11: 307. https://doi.org/10.3390/electronics7110307
APA StyleQian, C., Childers, B., Huang, L., Guo, H., & Wang, Z. (2018). CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube. Electronics, 7(11), 307. https://doi.org/10.3390/electronics7110307