[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Flip: Data-centric Edge CGRA Accelerator

Published: 18 December 2023 Publication History

Abstract

Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators due to the outstanding balance in flexibility, performance, and energy efficiency. Classic CGRAs statically map compute operations onto the processing elements (PE) and route the data dependencies among the operations through the Network-on-Chip. However, CGRAs are designed for fine-grained static instruction-level parallelism and struggle to accelerate applications with dynamic and irregular data-level parallelism, such as graph processing. To address this limitation, we present Flip, a novel accelerator that enhances traditional CGRA architectures to boost the performance of graph applications. Flip retains the classic CGRA execution model while introducing a special data-centric mode for efficient graph processing. Specifically, it leverages the inherent data parallelism of graph algorithms by mapping graph vertices onto PEs rather than the operations and supporting dynamic routing of temporary data according to the runtime evolution of the graph frontier. Experimental results demonstrate that Flip achieves up to 36× speedup with merely 19% more area compared to classic CGRAs. Compared to state-of-the-art large-scale graph processors, Flip has similar energy efficiency and 2.2× better area efficiency at a much-reduced power/area budget.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 105–117.
[2]
Andrey Ayupov, Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Steven Burns, and Ozcan Ozturk. 2017. A template-based design methodology for graph-parallel hardware accelerators. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 37, 2 (2017), 420–430.
[3]
Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. 2022. REVAMP: A systematic framework for heterogeneous CGRA realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 918–932.
[4]
Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, Juan Gómez Luna, Jakub Golinowski, Marcin Copik, Lukas Kapp-Schwoerer, Salvatore Di Girolamo, Nils Blach, Marek Konieczny, Onur Mutlu, and Torsten Hoefler. 2021. SISA: Set-centric instruction set architecture for graph mining on processing-in-memory systems. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21). Association for Computing Machinery, New York, NY, 282–297. DOI:
[5]
Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, and Torsten Hoefler. 2019. Graph processing on FPGAs: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697 (2019).
[6]
Mauro Bisson, Massimo Bernaschi, and Enrico Mastrostefano. 2016. Parallel distributed breadth first search on the Kepler architecture. IEEE Trans. Parallel Distrib. Syst. 27, 7 (2016), 2091–2102. DOI:
[7]
Aydin Buluc, Scott Beamer, Kamesh Madduri, Krste Asanovic, and David Patterson. 2017. Distributed-memory Breadth-first Search on Massive Graphs. arxiv:1705.04590 [cs.DC]
[8]
James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS’13).
[9]
Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. Polygraph: Exposing the value of flexibility for graph processing accelerators. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21). IEEE, 595–608.
[10]
Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2022. Systematically understanding graph accelerator dimensions and the value of hardware flexibility. IEEE Micro 42, 4 (2022), 87–96.
[11]
Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. 2019. GraphH: A processing-in-memory architecture for large-scale graph processing. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 38, 4 (2019), 640–653. DOI:
[12]
William James Dally and Brian Patrick Towles. 2004. Principles and Practices of Interconnection Networks. Elsevier.
[13]
Luke R. Everson, Sachin S. Sapatnekar, and Chris H. Kim. 2021. A time-based intra-memory computing graph processor featuring A* wavefront expansion and 2-D gradient control. IEEE J. Solid-state Circ. 56, 7 (2021), 2281–2290.
[14]
Graham Gobieski, Souradip Ghosh, Marijn Heule, Todd Mowry, Tony Nowatzki, Nathan Beckmann, and Brandon Lucia. 2022. A programmable, energy-minimal dataflow compiler and architecture. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22). IEEE, 546–564.
[15]
Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A survey on graph processing accelerators: Challenges and opportunities. J. Comput. Sci. Technol. 34, 2 (2019), 339–371.
[16]
Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th Annual IEEE International Workshop on Workload Characterization (WWC’01). IEEE, 3–14.
[17]
Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–13.
[18]
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2014. Branch-aware loop mapping on CGRAs. In Proceedings of the 51st Annual Design Automation Conference. 1–6.
[19]
Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-chip networks. Synthes. Lect. Comput. Archit. 12, 3 (2017), 1–210.
[20]
Nachiket Kapre. 2015. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP’15). IEEE, 9–16.
[21]
Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.
[22]
Manupa Karunaratne, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. 2019. 4D-CGRA: Introducing branch dimension to spatio-temporal application mapping on CGRAs. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD’19). IEEE, 1–8.
[23]
Guoqing Lei, Yong Dou, Rongchun Li, and Fei Xia. 2015. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Trans. Circ. Syst. II: Expr. Briefs 63, 5 (2015), 473–477.
[24]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data
[25]
Feifei Li, Dihan Cheng, Marios Hadjieleftheriou, George Kollios, and Shang-Hua Teng. 2005. On trip planning queries in spatial databases. In Proceedings of the International Symposium on Spatial and Temporal Databases. Springer, 273–290.
[26]
Zhaoying Li, Dhananjaya Wijerathne, Xianzhang Chen, Anuj Pathania, and Tulika Mitra. 2021. ChordMap: Automated mapping of streaming applications onto CGRA. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 41, 2 (2021), 306–319.
[27]
Zhaoying Li, D. Wijerathne, and Tulika Mitra. 2022. Coarse grained reconfigurable array CGRA. In Springer Handbook of Computer Architecture. Springer.
[28]
Zhaoying Li, Dan Wu, Dhananjaya Wijerathne, and Tulika Mitra. 2022. LISA: Graph neural network based portable mapping on spatial accelerators. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 444–459.
[29]
Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, and Shaojun Wei. 2019. A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications. ACM Comput. Surv. 52, 6 (2019), 1–39.
[30]
Sihao Liu, Jian Weng, Dylan Kupsh, Atefeh Sohrabizadeh, Zhengrong Wang, Licheng Guo, Jiuyang Liu, Maxim Zhulin, Rishabh Mani, Lucheng Zhang, Jason Cong, and Tony Nowatzki. 2022. OverGen: Improving FPGA usability through domain-specific overlay generation. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22). IEEE, 35–56.
[31]
Yucheng Low, Joseph E. Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E. Guestrin, and Joseph Hellerstein. 2014. GraphLab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014).
[32]
Kevin J. M. Martin. 2022. Twenty years of automated methods for mapping applications on CGRA. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’22). IEEE, 679–686.
[33]
Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48, 2 (2015), 1–39.
[34]
Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai Budiu. 2006. Tartan: Evaluating spatial computation for whole program execution. ACM SIGARCH Comput. Archit. News 34, 5 (2006), 163–174.
[35]
Quan M. Nguyen and Daniel Sanchez. 2021. Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture. 1064–1077.
[36]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 416–429.
[37]
Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Comput. Archit. News 44, 3 (2016), 166–177.
[38]
J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Proceedings of the IEEE Hot Chips Symposium (HCS’11). IEEE, 1–24.
[39]
Artur Podobas, Kentaro Sano, and Satoshi Matsuoka. 2020. A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (2020), 146719–146743.
[40]
Shafiur Rahman, Nael Abu-Ghazaleh, and Rajiv Gupta. 2020. GraphPulse: An event-driven hardware accelerator for asynchronous graph processing. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). IEEE, 908–921.
[41]
Steven M. Rubin and Raj Reddy. 1977. The locus model of search and its use in image interpretation. Cambridge, Massachusetts (1977), 590–595.
[42]
Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating graph processing using ReRAM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 531–543.
[43]
Jacob R. Stevens, Dipankar Das, Sasikanth Avancha, Bharat Kaul, and Anand Raghunathan. 2021. GNNerator: A hardware/software framework for accelerating graph neural networks. In Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC’21). IEEE, 955–960.
[44]
Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’03). IEEE, 291–302.
[45]
Cheng Tan, Nicolas Bohm Agostini, Jeff Zhang, Marco Minutoli, Vito Giovanni Castellana, Chenhao Xie, Tong Geng, Ang Li, Kevin Barker, and Antonino Tumeo. 2021. OpenCGRA: Democratizing coarse-grained reconfigurable arrays. In Proceedings of the IEEE 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP’21). IEEE, 149–155.
[46]
Cheng Tan, Thierry Tambe, Jeff Zhang, Bo Fang, Tong Geng, Gu-Yeon Wei, David Brooks, Antonino Tumeo, Ganesh Gopalakrishnan, and Ang Li. 2022. ASAP: Automatic synthesis of area-efficient and precision-aware CGRAs. In Proceedings of the 36th ACM International Conference on Supercomputing. 1–13.
[47]
Cheng Tan, Chenhao Xie, Ang Li, Kevin J. Barker, and Antonino Tumeo. 2020. OpenCGRA: An open-source unified framework for modeling, testing, and evaluating CGRAs. In Proceedings of the IEEE 38th International Conference on Computer Design (ICCD’20). IEEE, 381–388.
[48]
Cheng Tan, Chenhao Xie, Ang Li, Kevin J. Barker, and Antonino Tumeo. 2021. Aurora: Automated refinement of coarse-grained reconfigurable accelerators. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’21). IEEE, 1388–1393.
[49]
Dani Voitsechov and Yoav Etsion. 2014. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. ACM SIGARCH Comput. Archit. News 42, 3 (2014), 205–216.
[50]
Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal. 1997. Baring it all to software: Raw machines. Computer 30, 9 (1997), 86–93.
[51]
Bo Wang, Manupa Karunarathne, Aditi Kulkarni, Tulika Mitra, and Li-Shiuan Peh. 2019. HyCUBE: A 0.9 v 26.4 mops/mw, 290 pj/op, power efficient accelerator for IoT applications. In Proceedings of the IEEE Asian Solid-State Circuits Conference (A-SSCC’19). IEEE, 133–136.
[52]
Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, and Tony Nowatzki. 2020. DSAGEN: Synthesizing programmable spatial accelerators. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 268–281.
[53]
Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony Nowatzki. 2020. A hybrid systolic-dataflow architecture for inductive matrix algorithms. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 703–716.
[54]
Dhananjaya Wijerathne, Zhaoying Li, Thilini Kaushalya Bandara, and Tulika Mitra. 2022. PANORAMA: Divide-and-conquer approach for mapping complex loop kernels on CGRA. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 127–132.
[55]
Dhananjaya Wijerathne, Zhaoying Li, Manupa Karunarathne, Anuj Pathania, and Tulika Mitra. 2019. Cascade: High throughput data streaming via decoupled access-execute CGRA. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–26.
[56]
Dhananjaya Wijerathne, Zhaoying Li, Manupa Karunaratne, Li-Shiuan Peh, and Tulika Mitra. 2022. Morpher: An open-source integrated compilation and simulation framework for CGRA. In Proceedings of the 5th Workshop on Open-Source EDA Technology (WOSET’22).
[57]
Dhananjaya Wijerathne, Zhaoying Li, Anuj Pathania, Tulika Mitra, and Lothar Thiele. 2021. HiMap: Fast and scalable high-quality mapping on CGRA via hierarchical abstraction. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 41, 10 (2021), 3290–3303.
[58]
H.-S. Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai. 2012. Metal–oxide RRAM. Proc. IEEE 100, 6 (2012), 1951–1970.
[59]
Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. ScalaGraph: A scalable accelerator for massively parallel graph processing. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’22). IEEE, 199–212.
[60]
Baofen Yuan, Jianfeng Zhu, Xingchen Man, Zijiao Ma, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. Dynamic-II pipeline: Compiling loops with irregular branches on static-scheduling CGRA. IEEE Trans. Comput.-aid. Des. Integ. Circ. Syst. 41, 9 (2021), 2929–2942.
[61]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing communication for PIM-based graph processing with efficient data partition. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’18). IEEE, 544–557.
[62]
Yu Zhang, Xiaofei Liao, Hai Jin, Ligang He, Bingsheng He, Haikun Liu, and Lin Gu. 2021. DepGraph: A dependency-driven accelerator for efficient iterative graph processing. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’21). IEEE, 371–384.
[63]
Jinhong Zhou, Shaoli Liu, Qi Guo, Xuda Zhou, Tian Zhi, Daofu Liu, Chao Wang, Xuehai Zhou, Yunji Chen, and Tianshi Chen. 2017. TuNao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’17). IEEE, 731–734.
[64]
Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA framework for edge-centric graph processing. In Proceedings of the 15th ACM International Conference on Computing Frontiers. 69–77.
[65]
Youwei Zhuo, Chao Wang, Mingxing Zhang, Rui Wang, Dimin Niu, Yanzhi Wang, and Xuehai Qian. 2019. GraphQ: Scalable PIM-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19). Association for Computing Machinery, New York, NY, 712–725. DOI:

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 29, Issue 1
January 2024
521 pages
EISSN:1557-7309
DOI:10.1145/3613510
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 18 December 2023
Online AM: 03 November 2023
Accepted: 25 October 2023
Revised: 24 October 2023
Received: 25 May 2023
Published in TODAES Volume 29, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Coarse-grained Reconfigurable Array
  2. Graph Processing
  3. Accelerator

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 465
    Total Downloads
  • Downloads (Last 12 months)414
  • Downloads (Last 6 weeks)35
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media