More Web Proxy on the site http://driver.im/

research-article

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

Authors:

Dhananjaya Wijerathne,

Manupa Karunarathne,

Tulika MitraAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 18, Issue 5s

Article No.: 50, Pages 1 - 26

https://doi.org/10.1145/3358177

Published: 07 October 2019 Publication History

Abstract

A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.

We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

References

[1]

2019. MediaBench 2 Benchmark. http://mathstat.slu.edu/ fritts/mediabench/.

[2]

2019. PolyLib - A Library of Polyhedral Functions. http://icps.u-strasbg.fr/polylib/.

[3]

2019. The Polyhedral Benchmark Suite. http://web.cse.ohio-state.edu/&sim;pouchet.2/software/polybench/.

[4]

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2007. Compilers: Principles, Techniques, and Tools Second Edition.

Digital Library

[5]

George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the 15th International Conference on Computing Frontiers. ACM, 244--247.

Digital Library

[6]

Samit Chaudhuri and Asmus Hetzel. 2017. SAT-based compilation to a non-VonNeumann processor. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 675--682.

Digital Library

[7]

Liang Chen and Tulika Mitra. 2014. Graph minor approach for application mapping on CGRAs. Transactions on Reconfigurable Technology and Systems (TRETS) 7, 3 (2014), 21.

[8]

Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 141.

Digital Library

[9]

Philippe Clauss and Vincent Loechner. 1998. Parametric analysis of polyhedral iteration spaces. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 19, 2 (1998), 179--194.

Digital Library

[10]

Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.

[11]

Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. RAMP: Resource-aware mapping for CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[12]

Nasim Farahini, Ahmed Hemani, Hassan Sohofi, Syed MAH Jafri, Muhammad Adeel Tajammul, and Kolin Paul. 2014. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric. Microprocessors and Microsystems 38, 8 (2014), 788--802.

Digital Library

[13]

Blair Fort, Andrew Canis, Jongsok Choi, Nazanin Calagar, Ruolong Lian, Stefan Hadjis, Yu Ting Chen, Mathew Hall, Bain Syrowik, Tomasz Czajkowski, et al. 2014. Automating the design of processor/accelerator embedded systems with LegUp high-level synthesis. In 12th International Conference on Embedded and Ubiquitous Computing. IEEE, 120--129.

Digital Library

[14]

Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebeling, and Scott Hauck. 2009. SPR: An architecture-adaptive CGRA mapping tool. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 191--200.

Digital Library

[15]

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2012. EPIMap: Using epimorphism to map applications on CGRAs. In DAC Design Automation Conference. IEEE, 1280--1287.

Digital Library

[16]

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceedings of the 50th Annual Design Automation Conference. ACM, 18.

Digital Library

[17]

Kyuseung Han, Junwhan Ahn, and Kiyoung Choi. 2013. Power-efficient predication techniques for acceleration of control flow execution on CGRA. ACM Transactions on Architecture and Code Optimization (TACO) 10, 2 (2013), 8.

[18]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43. ACM, 118--130.

[19]

Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[20]

Manupa Karunaratne, Cheng Tan, Aditi Kulkarni, Tulika Mitra, and Li-Shiuan Peh. 2018. Dnestmap: Mapping deeply-nested loops on ultra-low power CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.

Digital Library

[21]

Heba Khdr, Santiago Pagani, Ericles Sousa, Vahid Lari, Anuj Pathania, Frank Hannig, Muhammad Shafique, Jürgen Teich, and Jörg Henkel. 2016. Power density-aware resource management for heterogeneous tiled multicores. Transactions on Computers (TC) 66, 3 (2016), 488--501.

Digital Library

[22]

Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2010. Operation and data mapping for CGRAs with multi-bank memory. In ACM Sigplan Notices, Vol. 45. ACM, 17--26.

Digital Library

[23]

Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. Transactions on design automation of electronic systems (TODAES) 16, 4 (2011), 42.

[24]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.

[25]

Jongeun Lee, Seongseok Seo, Hongsik Lee, and Hyeon Uk Sim. 2014. Flattening-based mapping of imperfect loop nests for CGRAs. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis. ACM, 9.

Digital Library

[26]

Dajiang Liu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2013. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proceedings of the 50th Annual Design Automation Conference. ACM, 19.

Digital Library

[27]

Frank H. McMahon. 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range. Technical Report. Lawrence Livermore National Lab., CA (USA).

[28]

Bingfeng Mei, M. Berekovic, and J. Y. Mignolet. 2007. ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing. Springer, 255--297.

[29]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173.

[30]

Chenyue Meng, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2015. Efficient memory partitioning for parallel data access in multidimensional arrays. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 160.

Digital Library

[31]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416--429.

Digital Library

[32]

Sai Manoj PD, Jie Lin, Shikai Zhu, Yingying Yin, Xu Liu, Xiwei Huang, Chongshen Song, Wenqi Zhang, Mei Yan, Zhiyi Yu, et al. 2017. A scalable network-on-chip microprocessor with 2.5 D integrated memory and accelerator. Transactions on Circuits and Systems I: Regular Papers 64, 6 (2017), 1432--1443.

[33]

Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2016. System-level optimization of accelerator local memory for heterogeneous systems-on-chip. Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 3 (2016), 435--448.

[34]

B Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of MICRO-27. The 27th Annual International Symposium on Microarchitecture. IEEE, 63--74.

Digital Library

[35]

Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. Transactions on Computers 49, 5 (2000), 465--481.

Digital Library

[36]

James E. Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119.

[37]

Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the International Symposium on Field-programmable Gate Arrays. ACM, 199--208.

Digital Library

[38]

Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, 12.

Digital Library

[39]

Dongjun Xu, Ningmei Yu, PD Sai Manoj, Kanwen Wang, Hao Yu, and Mingbin Yu. 2015. A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10.

[40]

Yanqin Yang, Meng Wang, Haijin Yan, Zili Shao, and Minyi Guo. 2010. Dynamic scratch-pad memory management with data pipelining for embedded systems. Concurrency and Computation: Practice and Experience 22, 13 (2010), 1874--1892.

Digital Library

[41]

Shouyi Yin, Zhicong Xie, Chenyue Meng, Leibo Liu, and Shaojun Wei. 2016. Multibank memory optimization for parallel data access in multiple data arrays. In International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.

Digital Library

[42]

Shouyi Yin, Zhicong Xie, Chenyue Meng, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2017. Memory partitioning for parallel multipattern data access in multiple data arrays. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 2 (2017), 431--444.

Digital Library

[43]

Shouyi Yin, Xianqing Yao, Dajiang Liu, Leibo Liu, and Shaojun Wei. 2015. Memory-aware loop mapping on coarse-grained reconfigurable architectures. Transactions on Very Large Scale Integration (VLSI) Systems 24, 5 (2015), 1895--1908.

Digital Library

[44]

Shouyi Yin, Xianqing Yao, Tianyi Lu, Dajiang Liu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory. Transactions on Parallel and Distributed Systems 28, 9 (2017), 2471--2485.

Digital Library

[45]

Shouyi Yin, Xianqing Yao, Tianyi Lu, Leibo Liu, and Shaojun Wei. 2016. Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 127.

Digital Library

Cited By

Chen KMason Nelson TKhadem AFayazi MSingapuram SDreslinski RTalati NKim HBlaauw D(2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
https://dl.acm.org/doi/10.1145/3695880
Tirelli CSapriza JRodríguez Álvarez RFerretti LDenkinger BAnsaloni GMiranda Calero JAtienza DPozzi L(2024)SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAsACM Journal on Emerging Technologies in Computing Systems10.1145/366367520:3(1-26)Online publication date: 22-May-2024
https://dl.acm.org/doi/10.1145/3663675
de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3656642
Show More Cited By

Index Terms

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Reconfigurable computing

Recommendations

A design flow for architecture exploration and implementation of partially reconfigurable processors

During the last years, the growing application complexity, design, and mask costs have compelled embedded system designers to increasingly consider partially reconfigurable application-specific instruction set processors (rASIPs) which combine a ...
Implementing CNNs Using a Linear Array of Full Mesh CGRAs
Applied Reconfigurable Computing. Architectures, Tools, and Applications
Abstract
This paper presents an implementation of a Convolutional Neural Network (CNN) algorithm using a linear array of full mesh dynamically and partially reconfigurable Coarse Grained Reconfigurable Arrays (CGRAs). Accelerating CNNs using GPUs and FPGAs ...
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures
LCTES '10

In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 18, Issue 5s

Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019

October 2019

1423 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3365919

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 07 October 2019

Accepted: 01 July 2019

Revised: 01 June 2019

Received: 01 April 2019

Published in TECS Volume 18, Issue 5s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Research Foundation Singapore
Huawei International Pte.Ltd.

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
562
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)7

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen KMason Nelson TKhadem AFayazi MSingapuram SDreslinski RTalati NKim HBlaauw D(2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
https://dl.acm.org/doi/10.1145/3695880
Tirelli CSapriza JRodríguez Álvarez RFerretti LDenkinger BAnsaloni GMiranda Calero JAtienza DPozzi L(2024)SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAsACM Journal on Emerging Technologies in Computing Systems10.1145/366367520:3(1-26)Online publication date: 22-May-2024
https://dl.acm.org/doi/10.1145/3663675
de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3656642
Liu DXia YShang JZhong JOuyang PYin S(2024)E2EMap: End-to-End Reinforcement Learning for CGRA Compilation via Reverse Mapping2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00015(46-60)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00015
Aliagha ECharaf NVenkatesan NGöhringer D(2024)DA-CGRA: Domain-Aware Heterogeneous Coarse-Grained Reconfigurable Architecture for the Edge2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00061(410-417)Online publication date: 28-Aug-2024
https://doi.org/10.1109/DSD64264.2024.00061
Li ZWijerathne DMitra T(2024)Coarse-Grained Reconfigurable Array (CGRA)Handbook of Computer Architecture10.1007/978-981-97-9314-3_50(465-505)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_50
Tirelli CFerretti LPozzi L(2023)SAT-MapIt: A SAT-based Modulo Scheduling Mapper for Coarse Grain Reconfigurable Architectures2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137123(1-6)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137123
Wu DChen PBandara TLi ZMitra T(2023)Flip: Data-centric Edge CGRA AcceleratorACM Transactions on Design Automation of Electronic Systems10.1145/363111829:1(1-25)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3631118
Yin CJing NJiang JWang QMao Z(2023)A Reschedulable Dataflow-SIMD Execution for Increased Utilization in CGRA Cross-Domain AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.318554442:3(874-886)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/TCAD.2022.3185544
Abbaszadeh MAbdelrahman TAzimi RCzajkowski TGoudarzi M(2023)Efficient Data Streaming for a Tightly-Coupled Coarse-Grained Reconfigurable Array2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00075(435-443)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00075
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents