More Web Proxy on the site http://driver.im/

research-article

Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach

Authors:

Srinivas Boppu,

Alexandru Tanase,

Oliver ReicheAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 4s

Article No.: 133, Pages 1 - 29

https://doi.org/10.1145/2584660

Published: 01 April 2014 Publication History

Abstract

We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

References

[1]

Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP -- A self-reconfigurable data processing architecture. J. Supercomput. 26, 2, 167--184.

Digital Library

[2]

Srinivas Boppu, Frank Hannig, Jürgen Teich, and Roberto Perez-Andrade. 2011. Towards symbolic run-time reconfiguration in tightly-coupled processor arrays. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig'11). IEEE Computer Society, 392--397.

Digital Library

[3]

Frank Bouwens, Mladen Berekovic, Bjorn De Sutter, and Georgi Gaydadjiev. 2008. Architecture enhancements for the adres coarse-grained reconfigurable array. In Proceedings of the 3^rd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08). 66--81.

Digital Library

[4]

Mike Butts. 2007. Synchronization through communication in a massively parallel processor array. IEEE Micro 27, 5, 32--40.

Digital Library

[5]

Olivier Certner, Zheng Li, Pierre Palatin, Olivier Temam, Frederic Arzel, and Nathalie Drach. 2008. A practical approach for reconciling high and predictable performance in non-regular parallel programs. In Proceedings of the Design, Automation and Test in Europe (DATE'08). 740--745.

Digital Library

[6]

Lakshmi N. Chakrapani, John Gyllenhaal, Wen-Mei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2005. Trimaran: An infrastructure for research in instruction-level parallelism. In Languages and Compilers for High Performance Computing. Rudolf Eigenmann, Zhiyuan Li, and Samuel P. Midkiff, Eds., Lecture Notes in Computer Science, vol. 3602, Springer, 32--41.

Digital Library

[7]

Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—The picoChip way&excl; In Communicating Process Architectures, IOS Press, 125--138.

[8]

Hritam Dutta, Frank Hannig, and Jürgen Teich. 2006. Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5^th International Conference on Parallel Computing in Electrical Engineering (PARELEC'06). IEEE Computer Society, 153--160.

Digital Library

[9]

Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing, David Padua, Ed., Springer, 1581--1592.

[10]

Martin Fowler. 2010. Domain Specific Languages. 1^st Ed. Addison-Wesley Professional.

Digital Library

[11]

Gcc. 2013. The gnu compiler collection. http://gcc.gnu.org.

[12]

Nathan Goulding-Hotta, Jack Sampson, Ganesh Venkatesh, Saturino Garcia, Joe Auricchio, Po-Chao Huang, Manish Arora, Siddhathar Nath, Vikram Bhatt, Jonathan Babb, Steven Swanson, and Michael Taylor. 2011. The GreenDroid mobile application processor: An architecture for silicon's dark future. IEEE Micro 31, 2, 86--95.

Digital Library

[13]

Linley Gwennup. 2011. Adapteva: More flops, less watts: Epiphany offers floating-point accelerator for mobile processors. Microprocessor Report 2. http://www.linleygroup.com/newsletters/newsletter_detail.php&quest;num=4716

[14]

Frank Hannig, Holger Ruckdeschel, Hritam Dutta, and Jürgen Teich. 2008. PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the 4^th International Workshop on Applied Reconfigurable Computing (ARC'08). Lecture Notes in Computer Science, vol. 4943, Springer, 287--293.

Digital Library

[15]

Frank Hannig, Moritz Schmid, Jürgen Teich, and Heinz Hornegger. 2010. A deeply pipelined and parallel architecture for denoising medical images. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT'10). 485--490.

[16]

Frank Hannig and Jürgen Teich. 2004. Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals. In Proceedings of the 15^th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP'04). IEEE Computer Society, 17--27.

Digital Library

[17]

Jörg Henkel, Andreas Herkersdorf, Lars Bauer, Thomas Wild, Michael Hübner, Ravi Kumar Pujari, Artjom Grudnitsky, Jan Heisswolf, Aurang Zaib, Benjamin Vogel, Vahid Lari, and Sebastian Kobbe. 2012. Invasive manycore architectures. In Proceedings of the 17^th Asia and South Pacific Design Automation Conference (ASP-DAC'12). 193--200.

[18]

Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson, Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel, Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, and Timothy Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). 108--109.

[19]

Gilles Kahn. 1974. The semantics of a simple language for parallel programming. In Proceedings of the International Federation for Information Processing Congress (IFIP'74). 471--475.

[20]

Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM's next-generation server processor. IEEE Micro 30, 2, 7--15.

Digital Library

[21]

Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT'06). 105--112.

[22]

Dmitrij Kissler, Andreas Strawetz, Frank Hannig, and Jürgen Teich. 2009. Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures. J. Low Power Electron. 5, 1, 96--105.

[23]

Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. StanleyWilliams, and Katherine Yelick. 2008. Exascale computing study: Technology challenges in achieving exascale systems. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.

[24]

Alexey Kupriyanov, Frank Hannig, Dmitrij Kissler, and Jürgen Teich. 2008. MAML: An ADL for designing single and multiprocessor architectures. In Processor Description Languages, Morgan Kaufmann, 295--327.

[25]

Vahid Lari, Shravan Muddasani, Srinivas Boppu, Frank Hannig, Moritz Schmid, and Jürgen Teich. 2012. Hierarchical power management for adaptive tightly-coupled processor arrays. ACM Trans. Des. Autom. Electron. Syst. 18, 1, 2:1--2:25.

Digital Library

[26]

Vahid Lari, Andriy Narovlyanskyy, Frank Hannig, and Jürgen Teich. 2011. Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22^nd IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP'11). IEEE Computer Society, 87--94.

Digital Library

[27]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'04). 75--86.

Digital Library

[28]

Jong-Eun Lee, Kiyoung Choi, and Nikil D. Dutt. 2003. An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In Proceedings of the ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). ACM Press, New York, 183--188.

Digital Library

[29]

Christian Lengauer, Michael Barnett, and Duncan G. Hudson Iii. 1991. Towards systolizing compilation. Distrib. Comput. 5, 1, 7--24.

Digital Library

[30]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55.

Digital Library

[31]

Masato Motomura. 2002. A dynamically reconfigurable processor architecture. In Microprocessor Forum, October, In-Stat/MDR, San Jose, CA.

[32]

Steven Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.

Digital Library

[33]

Aaftab Munshi. 2012. The OpenCL specification version 1.2. Khronos OpenCL Working Group. http://developer.amd.com/wordpress/media/2012/10/opencl-1.2.pdf

[34]

Pierre Palatin, Yves Lhuillier, and Olivier Temam. 2006. CAPSULE: Hardware-assisted parallel execution of component-based programs. In Proceedings of the 39^th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 247--258.

Digital Library

[35]

B. Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27^th Annual International Symposium on Microarchitecture (MICRO'94). ACM Press, New York, 63--74.

Digital Library

[36]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert Mcdonald, Rajagopalan Desikan, Saurabh Drolia, Madhu S. S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39^th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 480--491.

Digital Library

[37]

Vinay Saripalli, Guangyu Sun, Asit Mishra, Yuan Xie, Suman Datta, and Vijaykrishnan Narayanan. 2011. Exploiting heterogeneity for energy efficiency in chip multiprocessors. IEEE J. Emerg. Select. Topics Circ. Syst. 1, 2, 109--119.

[38]

Anand Lal Shimp. 2013. The ARM vs x86 wars have begun: In-depth power analysis of Atom, Krait and Cortex A15. http://www.anandtech.com/show/6536/arm-vs-x86-the-real-showdown/12.

[39]

Hartej Singh, Ming-Hau Lee, Guangming Lu, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 465--481.

Digital Library

[40]

Jürgen Teich. 2008. Invasive algorithms and architectures. it -- Inf. Technol. 50, 5, 300--310.

[41]

Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. 2011. Invasive computing: An overview. In Multiprocessor System-on-Chip: Hardware Design and Tool Integration, Springer, 241--268.

[42]

Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2013. Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24^th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'13). 1--9.

Digital Library

[43]

Lothar Thiele and Vwani Prasad Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, vol. a: Tutorials, Ed F. Deprettere and A. J. van der Veen, Eds., Elsevier, 329--339.

[44]

Tilera Corporation. 2013. http://www.tilera.com.

[45]

Girish Venkataramani, Walid A. Najjar, Fadi J. Kurdahi, Nader Bagherzadeh, Wim Böhm, and Jeff Hammes. 2003. Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embedd. Comput. Syst. 2, 4, 560--589.

Digital Library

[46]

Michael Joseph Wolfe. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley.

[47]

Jingling Xue. 1997. Unimodular transformations of non-perfectly nested loops. Parallel Comput. 22, 12, 1621--1645.

Digital Library

[48]

Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA.

Digital Library

Cited By

Walter DBrand MHeidorn CWitterauf MHannig FTeich J(2024)ALPACA: An Accelerator Chip for Nested Loop Programs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558549(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558549
Walter DAdamtschuk THannig FTeich J(2024)Analysis and Optimization of Block LU Decomposition for Execution on Tightly Coupled Processor Arrays2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00029(97-106)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00029
Brand MHannig FKeszocze OTeich J(2022)Precision- and Accuracy-Reconfigurable Processor Architectures—An OverviewIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.317375369:6(2661-2666)Online publication date: Jun-2022
https://doi.org/10.1109/TCSII.2022.3173753
Show More Cited By

Index Terms

Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
      2. Self-organizing autonomic computing
    2. Parallel architectures
      1. Systolic arrays
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Specialized application languages

Recommendations

Compact Code Generation for Tightly-Coupled Processor Arrays

In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for ...
Invasive Tightly Coupled Processor Arrays
Processor arrays generation for matrix algorithms used in embedded platforms implemented on FPGAs

Matrix algorithms are an important part of many digital signal processing applications as they are core kernels that are usually required to be applied many times while computing different tasks. Hardware assisted implementations using FPGAs provide a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 13, Issue 4s

Special Issue on Real-Time and Embedded Technology and Applications, Domain-Specific Multicore Computing, Cross-Layer Dependable Embedded Systems, and Application of Concurrency to System Design (ACSD'13)

July 2014

571 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2601432

Editors:
Sandeep K. Shukla
Virginia Tech, USA
,
Josep Carmona
Universitat Politècnica de Catalunya, Spain
,
Mihai Teodor Lazarescu
Politecnico di Torino, Italy
,
Marta Pietkiewicz-koutny
Newcastle University, UK

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 April 2014

Accepted: 01 September 2013

Revised: 01 June 2013

Received: 01 February 2013

Published in TECS Volume 13, Issue 4s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Research Training Group 1773 “Heterogeneous Image Systems”
Deutsche Forschungsgemeinschaft

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
682
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)5

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Walter DBrand MHeidorn CWitterauf MHannig FTeich J(2024)ALPACA: An Accelerator Chip for Nested Loop Programs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558549(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558549
Walter DAdamtschuk THannig FTeich J(2024)Analysis and Optimization of Block LU Decomposition for Execution on Tightly Coupled Processor Arrays2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00029(97-106)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00029
Brand MHannig FKeszocze OTeich J(2022)Precision- and Accuracy-Reconfigurable Processor Architectures—An OverviewIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.317375369:6(2661-2666)Online publication date: Jun-2022
https://doi.org/10.1109/TCSII.2022.3173753
Dhilleswararao PBoppu SManikandan MCenkeramaddi L(2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3229767
Walter DTeich JArun-Kumar SMery DSaha IZhang L(2021)LIONProceedings of the 19th ACM-IEEE International Conference on Formal Methods and Models for System Design10.1145/3487212.3487349(32-43)Online publication date: 20-Nov-2021
https://dl.acm.org/doi/10.1145/3487212.3487349
Witterauf MWalter DHannig FTeich J(2021)Symbolic Loop Compilation for Tightly Coupled Processor ArraysACM Transactions on Embedded Computing Systems10.1145/346689720:5(1-31)Online publication date: 29-Jul-2021
https://dl.acm.org/doi/10.1145/3466897
Akdur D(2021)Skills Gaps in the IndustryACM Transactions on Embedded Computing Systems10.1145/346334020:5(1-39)Online publication date: 9-Jul-2021
https://dl.acm.org/doi/10.1145/3463340
Leon VPaparouni TPetrongonas ESoudris DPekmestzi K(2021)Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point MultipliersACM Transactions on Embedded Computing Systems10.1145/344898020:5(1-21)Online publication date: 9-Jul-2021
https://dl.acm.org/doi/10.1145/3448980
Varghese BWang NBermbach DHong CLara EShi WStewart C(2021)A Survey on Edge Performance BenchmarkingACM Computing Surveys10.1145/344469254:3(1-33)Online publication date: 22-Apr-2021
https://dl.acm.org/doi/10.1145/3444692
Heidorn CWalter DCandir YHannig FTeich J(2021)Hand Sign Recognition via Deep Learning on Tightly Coupled Processor Arrays2021 31st International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL53798.2021.00079(388-388)Online publication date: Aug-2021
https://doi.org/10.1109/FPL53798.2021.00079
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents