[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

Published: 07 October 2019 Publication History

Abstract

A Coarse-Grained Reconfigurable Array (CGRA) is a promising high-performance low-power accelerator for compute-intensive loop kernels. While the mapping of the computations on the CGRA is a well-studied problem, bringing the data into the array at a high throughput remains a challenge. A conventional CGRA design involves on-array computations to generate memory addresses for data access undermining the attainable throughput. A decoupled access-execute architecture, on the other hand, isolates the memory access from the actual computations resulting in a significantly higher throughput.
We propose a novel decoupled access-execute CGRA design called CASCADE with full architecture and compiler support for high-throughput data streaming from an on-chip multi-bank memory. CASCADE offloads the address computations for the multi-bank data memory access to a custom designed programmable hardware. An end-to-end fully-automated compiler synchronizes the conflict-free movement of data between the memory banks and the CGRA. Experimental evaluations show on average 3× performance benefit and 2.2× performance per watt improvement for CASCADE compared to an iso-area conventional CGRA with a bigger processing array in lieu of a dedicated hardware memory address generation logic.

References

[1]
2019. MediaBench 2 Benchmark. http://mathstat.slu.edu/ fritts/mediabench/.
[2]
2019. PolyLib - A Library of Polyhedral Functions. http://icps.u-strasbg.fr/polylib/.
[3]
2019. The Polyhedral Benchmark Suite. http://web.cse.ohio-state.edu/∼pouchet.2/software/polybench/.
[4]
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2007. Compilers: Principles, Techniques, and Tools Second Edition.
[5]
George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios N Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the 15th International Conference on Computing Frontiers. ACM, 244--247.
[6]
Samit Chaudhuri and Asmus Hetzel. 2017. SAT-based compilation to a non-VonNeumann processor. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 675--682.
[7]
Liang Chen and Tulika Mitra. 2014. Graph minor approach for application mapping on CGRAs. Transactions on Reconfigurable Technology and Systems (TRETS) 7, 3 (2014), 21.
[8]
Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The reconfigurable streaming vector processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 141.
[9]
Philippe Clauss and Vincent Loechner. 1998. Parametric analysis of polyhedral iteration spaces. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 19, 2 (1998), 179--194.
[10]
Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[11]
Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. RAMP: Resource-aware mapping for CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[12]
Nasim Farahini, Ahmed Hemani, Hassan Sohofi, Syed MAH Jafri, Muhammad Adeel Tajammul, and Kolin Paul. 2014. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric. Microprocessors and Microsystems 38, 8 (2014), 788--802.
[13]
Blair Fort, Andrew Canis, Jongsok Choi, Nazanin Calagar, Ruolong Lian, Stefan Hadjis, Yu Ting Chen, Mathew Hall, Bain Syrowik, Tomasz Czajkowski, et al. 2014. Automating the design of processor/accelerator embedded systems with LegUp high-level synthesis. In 12th International Conference on Embedded and Ubiquitous Computing. IEEE, 120--129.
[14]
Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebeling, and Scott Hauck. 2009. SPR: An architecture-adaptive CGRA mapping tool. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 191--200.
[15]
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2012. EPIMap: Using epimorphism to map applications on CGRAs. In DAC Design Automation Conference. IEEE, 1280--1287.
[16]
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula. 2013. REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceedings of the 50th Annual Design Automation Conference. ACM, 18.
[17]
Kyuseung Han, Junwhan Ahn, and Kiyoung Choi. 2013. Power-efficient predication techniques for acceleration of control flow execution on CGRA. ACM Transactions on Architecture and Code Optimization (TACO) 10, 2 (2013), 8.
[18]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient execution of memory access phases using dataflow specialization. In SIGARCH Computer Architecture News, Vol. 43. ACM, 118--130.
[19]
Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[20]
Manupa Karunaratne, Cheng Tan, Aditi Kulkarni, Tulika Mitra, and Li-Shiuan Peh. 2018. Dnestmap: Mapping deeply-nested loops on ultra-low power CGRAs. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[21]
Heba Khdr, Santiago Pagani, Ericles Sousa, Vahid Lari, Anuj Pathania, Frank Hannig, Muhammad Shafique, Jürgen Teich, and Jörg Henkel. 2016. Power density-aware resource management for heterogeneous tiled multicores. Transactions on Computers (TC) 66, 3 (2016), 488--501.
[22]
Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2010. Operation and data mapping for CGRAs with multi-bank memory. In ACM Sigplan Notices, Vol. 45. ACM, 17--26.
[23]
Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. Transactions on design automation of electronic systems (TODAES) 16, 4 (2011), 42.
[24]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.
[25]
Jongeun Lee, Seongseok Seo, Hongsik Lee, and Hyeon Uk Sim. 2014. Flattening-based mapping of imperfect loop nests for CGRAs. In Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis. ACM, 9.
[26]
Dajiang Liu, Shouyi Yin, Leibo Liu, and Shaojun Wei. 2013. Polyhedral model based mapping optimization of loop nests for CGRAs. In Proceedings of the 50th Annual Design Automation Conference. ACM, 19.
[27]
Frank H. McMahon. 1986. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range. Technical Report. Lawrence Livermore National Lab., CA (USA).
[28]
Bingfeng Mei, M. Berekovic, and J. Y. Mignolet. 2007. ADRES 8 DRESC: Architecture and compiler for coarse-grain reconfigurable processors. In Fine-and Coarse-Grain Reconfigurable Computing. Springer, 255--297.
[29]
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In International Conference on Field-Programmable Technology, 2002 (FPT). Proceedings. IEEE, 166--173.
[30]
Chenyue Meng, Shouyi Yin, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2015. Efficient memory partitioning for parallel data access in multidimensional arrays. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 160.
[31]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 416--429.
[32]
Sai Manoj PD, Jie Lin, Shikai Zhu, Yingying Yin, Xu Liu, Xiwei Huang, Chongshen Song, Wenqi Zhang, Mei Yan, Zhiyi Yu, et al. 2017. A scalable network-on-chip microprocessor with 2.5 D integrated memory and accelerator. Transactions on Circuits and Systems I: Regular Papers 64, 6 (2017), 1432--1443.
[33]
Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2016. System-level optimization of accelerator local memory for heterogeneous systems-on-chip. Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 3 (2016), 435--448.
[34]
B Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of MICRO-27. The 27th Annual International Symposium on Microarchitecture. IEEE, 63--74.
[35]
Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. Transactions on Computers 49, 5 (2000), 465--481.
[36]
James E. Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119.
[37]
Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the International Symposium on Field-programmable Gate Arrays. ACM, 199--208.
[38]
Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, 12.
[39]
Dongjun Xu, Ningmei Yu, PD Sai Manoj, Kanwen Wang, Hao Yu, and Mingbin Yu. 2015. A 2.5-D memory-logic integration with data-pattern-aware memory controller. Design 8 Test 32, 4 (2015), 1--10.
[40]
Yanqin Yang, Meng Wang, Haijin Yan, Zili Shao, and Minyi Guo. 2010. Dynamic scratch-pad memory management with data pipelining for embedded systems. Concurrency and Computation: Practice and Experience 22, 13 (2010), 1874--1892.
[41]
Shouyi Yin, Zhicong Xie, Chenyue Meng, Leibo Liu, and Shaojun Wei. 2016. Multibank memory optimization for parallel data access in multiple data arrays. In International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
[42]
Shouyi Yin, Zhicong Xie, Chenyue Meng, Peng Ouyang, Leibo Liu, and Shaojun Wei. 2017. Memory partitioning for parallel multipattern data access in multiple data arrays. Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 2 (2017), 431--444.
[43]
Shouyi Yin, Xianqing Yao, Dajiang Liu, Leibo Liu, and Shaojun Wei. 2015. Memory-aware loop mapping on coarse-grained reconfigurable architectures. Transactions on Very Large Scale Integration (VLSI) Systems 24, 5 (2015), 1895--1908.
[44]
Shouyi Yin, Xianqing Yao, Tianyi Lu, Dajiang Liu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory. Transactions on Parallel and Distributed Systems 28, 9 (2017), 2471--2485.
[45]
Shouyi Yin, Xianqing Yao, Tianyi Lu, Leibo Liu, and Shaojun Wei. 2016. Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 127.

Cited By

View all
  • (2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
  • (2024)SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAsACM Journal on Emerging Technologies in Computing Systems10.1145/366367520:3(1-26)Online publication date: 22-May-2024
  • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
  • Show More Cited By

Index Terms

  1. CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 18, Issue 5s
      Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019
      October 2019
      1423 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3365919
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 07 October 2019
      Accepted: 01 July 2019
      Revised: 01 June 2019
      Received: 01 April 2019
      Published in TECS Volume 18, Issue 5s

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Coarse grained reconfigurable arrays
      2. decoupled access-execute architectures
      3. multi-bank memory partitioning

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)69
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 18-Sep-2024
      • (2024)SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAsACM Journal on Emerging Technologies in Computing Systems10.1145/366367520:3(1-26)Online publication date: 22-May-2024
      • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
      • (2024)E2EMap: End-to-End Reinforcement Learning for CGRA Compilation via Reverse Mapping2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00015(46-60)Online publication date: 2-Mar-2024
      • (2024)DA-CGRA: Domain-Aware Heterogeneous Coarse-Grained Reconfigurable Architecture for the Edge2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00061(410-417)Online publication date: 28-Aug-2024
      • (2024)Coarse-Grained Reconfigurable Array (CGRA)Handbook of Computer Architecture10.1007/978-981-97-9314-3_50(465-505)Online publication date: 21-Dec-2024
      • (2023)SAT-MapIt: A SAT-based Modulo Scheduling Mapper for Coarse Grain Reconfigurable Architectures2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137123(1-6)Online publication date: Apr-2023
      • (2023)Flip: Data-centric Edge CGRA AcceleratorACM Transactions on Design Automation of Electronic Systems10.1145/363111829:1(1-25)Online publication date: 18-Dec-2023
      • (2023)A Reschedulable Dataflow-SIMD Execution for Increased Utilization in CGRA Cross-Domain AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.318554442:3(874-886)Online publication date: 1-Mar-2023
      • (2023)Efficient Data Streaming for a Tightly-Coupled Coarse-Grained Reconfigurable Array2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00075(435-443)Online publication date: May-2023
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media