More Web Proxy on the site http://driver.im/

research-article

Open access

PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates

Authors:

Anurag Mukkara,

Nathan Beckmann,

Daniel SanchezAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1009 - 1022

https://doi.org/10.1145/3352460.3358254

Published: 12 October 2019 Publication History

Abstract

Many applications perform frequent scatter update operations to large data structures. For example, in push-style graph algorithms, processing each vertex requires updating the data of all its neighbors. Neighbors are often scattered over the whole graph, so these scatter updates have poor spatial and temporal locality. In current systems, scatter updates suffer high synchronization costs and high memory traffic. These drawbacks make push-style execution unattractive, and, when algorithms allow it, programmers gravitate towards pull-style implementations based on gather reads instead.

We present PHI, a push cache hierarchy that makes scatter updates synchronization- and bandwidth-efficient. PHI adds support for pushing sparse, commutative updates from cores towards main memory. PHI adds simple compute logic at each cache level to buffer and coalesce these commutative updates throughout the hierarchy. This avoids synchronization, exploits temporal locality, and produces a load-balanced execution. Moreover, PHI exploits spatial locality by selectively deferring updates with poor spatial locality, batching them to achieve sequential main memory transfers.

PHI is the first system to leverage both the temporal and spatial locality benefits of commutative scatter updates, some of which do not apply to gather reads. As a result, PHI not only makes push algorithms efficient, but makes them consistently faster than pull ones. We evaluate PHI on graph algorithms and other sparse applications processing large inputs. PHI improves performance by 4.7× on average (and by up to 11×), and reduces memory traffic by 2× (and by up to 5×).

References

[1]

Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, and Valeria Bertacco. 2018. Heterogeneous Memory Subsystem for Natural Graph Analytics. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).

[2]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd annual International Symposium on Computer Architecture (ISCA-42).

Digital Library

[3]

Sam Ainsworth and Timothy M Jones. 2016. Graph Prefetching Using Data Structure Knowledge. In Proceedings of the International Conference on Supercomputing (ICS'16).

Digital Library

[4]

Sam Ainsworth and Timothy M Jones. 2018. An event-triggered programmable prefetcher for irregular workloads. In Proceedings of the 23rd international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIII).

Digital Library

[5]

Vignesh Balaji and Brandon Lucia. 2018. When is Graph Reordering an Optimization? Studying the Effect of Lightweight Graph Reordering Across Applications and Input Graphs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC).

[6]

Vignesh Balaji, Dhruva Tirumala, and Brandon Lucia. 2017. Flexible Support for Fast Parallel Commutative Updates. arXiv preprint arXiv:1709.09491 (2017).

[7]

Çağrı Balkesen, Jens Teubner, Gustavo Alonso, and M Tamer Özsu. 2014. Mainmemory hash joins on modern processor architectures. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2014).

[8]

Scott Beamer, Krste Asanović, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the ACM/IEEE conference on Supercomputing (SC12).

Digital Library

[9]

Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. arXiv:1508.03619 [cs.DC] (2015).

[10]

Scott Beamer, Krste Asanovic, and David Patterson. 2017. Reducing Pagerank communication via Propagation Blocking. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]

Maciej Besta, Michał Podstawski, Linus Groner, Edgar Solomonik, and Torsten Hoefler. 2017. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing.

Digital Library

[12]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining.

[13]

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to algorithms (3rd ed.). MIT press.

[14]

Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA---A Case Study of Breadth-First Search. In Proceedings of the 24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA-24).

Digital Library

[15]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM TOMS 38, 1 (2011).

[16]

Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2017. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).

Digital Library

[17]

Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. 1983. The NYU Ultracomputer? Designing an MIMD Shared Memory Parallel Computer. IEEE Transactions on computers 2 (1983).

[18]

Samuel Grossman and Christos Kozyrakis. 2019. A New Frontier for Pull-Based Graph Processing. arXiv preprint arXiv:1903.07754 (2019).

[19]

Samuel Grossman, Heiner Litz, and Christos Kozyrakis. 2018. Making pull-based graph processing performant. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).

Digital Library

[20]

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO-49).

[21]

Per Hammarlund, Alberto J. Martinez, Atiq A. Bajwa, David L. Hill, Erik Hallnor, Hong Jiang, Martin Dixon, Michael Derr, Mikal Hunsaker, Rajesh Kumar, Randy B. Osborne, Ravi Rajwar, Ronak Singhal, Reynold D'Sa, Robert Chappell, Shiv Kaushik, Srinivas Chennupaty, Stephan Jourdan, Steve Gunther, Tom Piazza, and Ted Burton. 2014. Haswell: The fourth-generation intel core processor. IEEE Micro 34, 2 (2014).

[22]

Song Han, Huizi Mao, and William J Dally. 2016. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In 4th International Conference on Learning Representations (ICLR-4).

[23]

Henry Hoffmann, David Wentzlaff, and Anant Agarwal. 2010. Remote store programming. In Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers (HiPEAC).

Digital Library

[24]

Nangate Inc. 2008. The NanGate 45nm Open Cell Library. http://www.nangate.com/?page_id=2325.

[25]

Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th annual International Symposium on Computer Architecture (ISCA-37).

Digital Library

[26]

Norman P. Jouppi. 1993. Cache Write Policies and Performance. In Proceedings of the 20th annual International Symposium on Computer Architecture (ISCA-20).

Digital Library

[27]

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraF-Boost: Using accelerated flash storage for external graph analytics. In Proceedings of the 45th annual International Symposium on Computer Architecture (ISCA-45).

[28]

Richard E Kessler and James L Schwarzmeier. 1993. CRAY T3D: A new dimension for Cray Research. In Digest of Papers. COMPCON Spring.

[29]

Vladimir Kiriansky, Yunming Zhang, and Saman Amarasinghe. 2016. Optimizing indirect memory references with milk. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT-25).

Digital Library

[30]

Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).

Digital Library

[31]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web (WWW-19).

Digital Library

[32]

James Laudon and Daniel Lenoski. 1997. The SGI Origin: a ccNUMA highly scalable server. In Proceedings of the 24th annual International Symposium on Computer Architecture (ISCA-24).

Digital Library

[33]

Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-42).

Digital Library

[34]

Clémence Magnien, Matthieu Latapy, and Michel Habib. 2009. Fast computation of empirically tight bounds for the diameter of massive graphs. JEA 13 (2009).

[35]

Jasmina Malicevic, Baptiste Joseph Eustache Lepers, and Willy Zwaenepoel. 2017. Everything you always wanted to know about multicore graph processing but were afraid to ask. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC).

[36]

Frank McSherry. 2005. A uniform approach to accelerated PageRank computation. In Proceedings of the 14th International Conference on World Wide Web (WWW-14).

Digital Library

[37]

Micron. 2013. 1.35V DDR3L power calculator (4Gb x16 chips).

[38]

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling. In Proceedings of the 51st annual IEEE/ACM international symposium on Microarchitecture (MICRO-51).

Digital Library

[39]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. 2017. Cache-Guided Scheduling: Exploiting caches to maximize locality in graph processing. In AGP'17.

[40]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. In Proceedings of the 23rd IEEE international symposium on High Performance Computer Architecture (HPCA-23).

[41]

Michal Nazarewicz. 2012. A deep dive into CMA. LWN, https://lwn.net/Articles/486301/.

[42]

Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C Hoe, José F Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA framework for vertex-centric graph computation. In Proceedings of the 22nd IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM-22).

[43]

Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A dataflow library for graph analytics acceleration. In Proceedings of the 24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA-24).

Digital Library

[44]

Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 43rd annual International Symposium on Computer Architecture (ISCA-43).

Digital Library

[45]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.

[46]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24).

[47]

Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th annual International Symposium on Computer Architecture (ISCA-40).

Digital Library

[48]

Nadathur Satish, Narayanan Sundaram, Md Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. 2014. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD international conference on management of data (SIGMOD).

Digital Library

[49]

Steven L Scott. 1996. Synchronization and communication in the T3E multiprocessor. In Proceedings of the 7th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII).

Digital Library

[50]

Julian Shun and Guy E Blelloch. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).

Digital Library

[51]

Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2017. Chasing away rats: Semantics and evaluation for relaxed atomics on heterogeneous systems. In Proceedings of the 44th annual International Symposium on Computer Architecture (ISCA-44).

Digital Library

[52]

Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating graph processing using ReRAM. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24).

[53]

Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R Dulloor, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics made productive. Proceedings of the VLDB Endowment (2015).

Digital Library

[54]

Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. 2016. Speedup graph processing by graph ordering. In Proceedings of the 2016 ACM SIGMOD international conference on management of data (SIGMOD).

Digital Library

[55]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the ACM/IEEE conference on Supercomputing (SC07).

Digital Library

[56]

Craig M Wittenbrink, Emmett Kilgariff, and Arjun Prabhu. 2011. Fermi GF100 GPU architecture. IEEE Micro 31, 2 (2011).

[57]

Clifford Wolf, Johann Glaser, and Johannes Kepler. 2013. Yosys-a free Verilog synthesis suite. In Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip).

[58]

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect Memory Prefetcher. In Proceedings of the 48th annual IEEE/ACM international symposium on Microarchitecture (MICRO-48).

Digital Library

[59]

Pingpeng Yuan, Changfeng Xie, Ling Liu, and Hai Jin. 2016. PathGraph: A path centric graph processing system. IEEE TPDS (2016).

[60]

Albert-Jan Nicholas Yzelman and Dirk Roose. 2014. High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE TPDS (2014).

[61]

Guowei Zhang, Virginia Chiu, and Daniel Sanchez. 2016. Exploiting semantic commutativity in hardware speculation. In Proceedings of the 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO-49).

Digital Library

[62]

Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the 48th annual IEEE/ACM international symposium on Microarchitecture (MICRO-48).

Digital Library

[63]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24).

[64]

Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Matei Zaharia, and Saman Amarasinghe. 2017. Making caches work for graph analytics. IEEE BigData (2017).

[65]

Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, and Bing Bing Zhou. 2018. FBSGraph: Accelerating asynchronous graph processing via forward and backward sweeping. IEEE TKDE (2018).

[66]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. Graphit: A high-performance graph dsl. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).

Digital Library

[67]

Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC).

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Yu HZhang YHe LZhao YLi XXin RZhao JLiao XLiu HHe BJin H(2024)RAHP: A Redundancy-aware Accelerator for High-performance Hypergraph Neural Network2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00094(1264-1277)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00094
Show More Cited By

Index Terms

PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Exploiting locality in graph analytics through hardware-accelerated traversal scheduling
MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture

Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential ...
LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting ...
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
1,115
Total Downloads

Downloads (Last 12 months)222
Downloads (Last 6 weeks)33

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Yu HZhang YHe LZhao YLi XXin RZhao JLiao XLiu HHe BJin H(2024)RAHP: A Redundancy-aware Accelerator for High-performance Hypergraph Neural Network2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00094(1264-1277)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00094
Lee HSanchez D(2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00092
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Yang YEmer JSanchez D(2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00072
Wang SXu HMamandipoor AMahapatra RAhn BGhodrati SKailas KAlian MEsmaeilzadeh H(2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00083
Fu GXia TLuo ZChen RZhao WRen P(2024)Differential-Matching Prefetcher for Indirect Memory Access2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00040(439-453)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00040
Isaac--Chassande VEvans ADurand YRousseau F(2024)SpDCache: Region-Based Reduction Cache for Outer-Product Sparse Matrix Kernels2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00012(3-7)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00012
Zou MZhang MWang RSun XYe XFan DTang Z(2024)Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data ManagementJournal of Computer Science and Technology10.1007/s11390-023-2939-x39:4(871-894)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11390-023-2939-x
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents