[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach

Published: 01 April 2014 Publication History

Abstract

We introduce a novel class of massively parallel processor architectures called invasive Tightly-Coupled Processor Arrays (TCPAs). The presented processor class is a highly parameterizable template which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in way that they support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release the resources. Furthermore, we show how invasive computing can be used as an enabler for power management. For the first time, we present a seamless mapping flow for TCPAs, based on a domain-specific language. Moreover, we outline a complete symbolic mapping approach. Finally, we support our claims by comparing a TCPA against an ARM Mali-T604 GPU in terms of performance and energy efficiency.

References

[1]
Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP -- A self-reconfigurable data processing architecture. J. Supercomput. 26, 2, 167--184.
[2]
Srinivas Boppu, Frank Hannig, Jürgen Teich, and Roberto Perez-Andrade. 2011. Towards symbolic run-time reconfiguration in tightly-coupled processor arrays. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig'11). IEEE Computer Society, 392--397.
[3]
Frank Bouwens, Mladen Berekovic, Bjorn De Sutter, and Georgi Gaydadjiev. 2008. Architecture enhancements for the adres coarse-grained reconfigurable array. In Proceedings of the 3rd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08). 66--81.
[4]
Mike Butts. 2007. Synchronization through communication in a massively parallel processor array. IEEE Micro 27, 5, 32--40.
[5]
Olivier Certner, Zheng Li, Pierre Palatin, Olivier Temam, Frederic Arzel, and Nathalie Drach. 2008. A practical approach for reconciling high and predictable performance in non-regular parallel programs. In Proceedings of the Design, Automation and Test in Europe (DATE'08). 740--745.
[6]
Lakshmi N. Chakrapani, John Gyllenhaal, Wen-Mei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2005. Trimaran: An infrastructure for research in instruction-level parallelism. In Languages and Compilers for High Performance Computing. Rudolf Eigenmann, Zhiyuan Li, and Samuel P. Midkiff, Eds., Lecture Notes in Computer Science, vol. 3602, Springer, 32--41.
[7]
Andrew Duller, Gajinder Panesar, and Daniel Towner. 2003. Parallel processing—The picoChip way! In Communicating Process Architectures, IOS Press, 125--138.
[8]
Hritam Dutta, Frank Hannig, and Jürgen Teich. 2006. Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC'06). IEEE Computer Society, 153--160.
[9]
Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing, David Padua, Ed., Springer, 1581--1592.
[10]
Martin Fowler. 2010. Domain Specific Languages. 1st Ed. Addison-Wesley Professional.
[11]
Gcc. 2013. The gnu compiler collection. http://gcc.gnu.org.
[12]
Nathan Goulding-Hotta, Jack Sampson, Ganesh Venkatesh, Saturino Garcia, Joe Auricchio, Po-Chao Huang, Manish Arora, Siddhathar Nath, Vikram Bhatt, Jonathan Babb, Steven Swanson, and Michael Taylor. 2011. The GreenDroid mobile application processor: An architecture for silicon's dark future. IEEE Micro 31, 2, 86--95.
[13]
Linley Gwennup. 2011. Adapteva: More flops, less watts: Epiphany offers floating-point accelerator for mobile processors. Microprocessor Report 2. http://www.linleygroup.com/newsletters/newsletter_detail.php?num=4716
[14]
Frank Hannig, Holger Ruckdeschel, Hritam Dutta, and Jürgen Teich. 2008. PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the 4th International Workshop on Applied Reconfigurable Computing (ARC'08). Lecture Notes in Computer Science, vol. 4943, Springer, 287--293.
[15]
Frank Hannig, Moritz Schmid, Jürgen Teich, and Heinz Hornegger. 2010. A deeply pipelined and parallel architecture for denoising medical images. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT'10). 485--490.
[16]
Frank Hannig and Jürgen Teich. 2004. Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals. In Proceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP'04). IEEE Computer Society, 17--27.
[17]
Jörg Henkel, Andreas Herkersdorf, Lars Bauer, Thomas Wild, Michael Hübner, Ravi Kumar Pujari, Artjom Grudnitsky, Jan Heisswolf, Aurang Zaib, Benjamin Vogel, Vahid Lari, and Sebastian Kobbe. 2012. Invasive manycore architectures. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASP-DAC'12). 193--200.
[18]
Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson, Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel, Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, and Timothy Mattson. 2010. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC'10). 108--109.
[19]
Gilles Kahn. 1974. The semantics of a simple language for parallel programming. In Proceedings of the International Federation for Information Processing Congress (IFIP'74). 471--475.
[20]
Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM's next-generation server processor. IEEE Micro 30, 2, 7--15.
[21]
Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT'06). 105--112.
[22]
Dmitrij Kissler, Andreas Strawetz, Frank Hannig, and Jürgen Teich. 2009. Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures. J. Low Power Electron. 5, 1, 96--105.
[23]
Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. StanleyWilliams, and Katherine Yelick. 2008. Exascale computing study: Technology challenges in achieving exascale systems. http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.
[24]
Alexey Kupriyanov, Frank Hannig, Dmitrij Kissler, and Jürgen Teich. 2008. MAML: An ADL for designing single and multiprocessor architectures. In Processor Description Languages, Morgan Kaufmann, 295--327.
[25]
Vahid Lari, Shravan Muddasani, Srinivas Boppu, Frank Hannig, Moritz Schmid, and Jürgen Teich. 2012. Hierarchical power management for adaptive tightly-coupled processor arrays. ACM Trans. Des. Autom. Electron. Syst. 18, 1, 2:1--2:25.
[26]
Vahid Lari, Andriy Narovlyanskyy, Frank Hannig, and Jürgen Teich. 2011. Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP'11). IEEE Computer Society, 87--94.
[27]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'04). 75--86.
[28]
Jong-Eun Lee, Kiyoung Choi, and Nikil D. Dutt. 2003. An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In Proceedings of the ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'03). ACM Press, New York, 183--188.
[29]
Christian Lengauer, Michael Barnett, and Duncan G. Hudson Iii. 1991. Towards systolizing compilation. Distrib. Comput. 5, 1, 7--24.
[30]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55.
[31]
Masato Motomura. 2002. A dynamically reconfigurable processor architecture. In Microprocessor Forum, October, In-Stat/MDR, San Jose, CA.
[32]
Steven Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.
[33]
Aaftab Munshi. 2012. The OpenCL specification version 1.2. Khronos OpenCL Working Group. http://developer.amd.com/wordpress/media/2012/10/opencl-1.2.pdf
[34]
Pierre Palatin, Yves Lhuillier, and Olivier Temam. 2006. CAPSULE: Hardware-assisted parallel execution of component-based programs. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 247--258.
[35]
B. Ramakrishna Rau. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO'94). ACM Press, New York, 63--74.
[36]
Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert Mcdonald, Rajagopalan Desikan, Saurabh Drolia, Madhu S. S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE Computer Society, 480--491.
[37]
Vinay Saripalli, Guangyu Sun, Asit Mishra, Yuan Xie, Suman Datta, and Vijaykrishnan Narayanan. 2011. Exploiting heterogeneity for energy efficiency in chip multiprocessors. IEEE J. Emerg. Select. Topics Circ. Syst. 1, 2, 109--119.
[38]
Anand Lal Shimp. 2013. The ARM vs x86 wars have begun: In-depth power analysis of Atom, Krait and Cortex A15. http://www.anandtech.com/show/6536/arm-vs-x86-the-real-showdown/12.
[39]
Hartej Singh, Ming-Hau Lee, Guangming Lu, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 465--481.
[40]
Jürgen Teich. 2008. Invasive algorithms and architectures. it -- Inf. Technol. 50, 5, 300--310.
[41]
Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. 2011. Invasive computing: An overview. In Multiprocessor System-on-Chip: Hardware Design and Tool Integration, Springer, 241--268.
[42]
Jürgen Teich, Alexandru Tanase, and Frank Hannig. 2013. Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'13). 1--9.
[43]
Lothar Thiele and Vwani Prasad Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, vol. a: Tutorials, Ed F. Deprettere and A. J. van der Veen, Eds., Elsevier, 329--339.
[44]
Tilera Corporation. 2013. http://www.tilera.com.
[45]
Girish Venkataramani, Walid A. Najjar, Fadi J. Kurdahi, Nader Bagherzadeh, Wim Böhm, and Jeff Hammes. 2003. Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embedd. Comput. Syst. 2, 4, 560--589.
[46]
Michael Joseph Wolfe. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley.
[47]
Jingling Xue. 1997. Unimodular transformations of non-perfectly nested loops. Parallel Comput. 22, 12, 1621--1645.
[48]
Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA.

Cited By

View all
  • (2024)ALPACA: An Accelerator Chip for Nested Loop Programs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558549(1-5)Online publication date: 19-May-2024
  • (2024)Analysis and Optimization of Block LU Decomposition for Execution on Tightly Coupled Processor Arrays2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00029(97-106)Online publication date: 24-Jul-2024
  • (2022)Precision- and Accuracy-Reconfigurable Processor Architectures—An OverviewIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.317375369:6(2661-2666)Online publication date: Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 13, Issue 4s
Special Issue on Real-Time and Embedded Technology and Applications, Domain-Specific Multicore Computing, Cross-Layer Dependable Embedded Systems, and Application of Concurrency to System Design (ACSD'13)
July 2014
571 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2601432
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 April 2014
Accepted: 01 September 2013
Revised: 01 June 2013
Received: 01 February 2013
Published in TECS Volume 13, Issue 4s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Processor arrays
  2. code generation
  3. energy efficiency
  4. performance

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)5
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ALPACA: An Accelerator Chip for Nested Loop Programs2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558549(1-5)Online publication date: 19-May-2024
  • (2024)Analysis and Optimization of Block LU Decomposition for Execution on Tightly Coupled Processor Arrays2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00029(97-106)Online publication date: 24-Jul-2024
  • (2022)Precision- and Accuracy-Reconfigurable Processor Architectures—An OverviewIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2022.317375369:6(2661-2666)Online publication date: Jun-2022
  • (2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
  • (2021)LIONProceedings of the 19th ACM-IEEE International Conference on Formal Methods and Models for System Design10.1145/3487212.3487349(32-43)Online publication date: 20-Nov-2021
  • (2021)Symbolic Loop Compilation for Tightly Coupled Processor ArraysACM Transactions on Embedded Computing Systems10.1145/346689720:5(1-31)Online publication date: 29-Jul-2021
  • (2021)Skills Gaps in the IndustryACM Transactions on Embedded Computing Systems10.1145/346334020:5(1-39)Online publication date: 9-Jul-2021
  • (2021)Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point MultipliersACM Transactions on Embedded Computing Systems10.1145/344898020:5(1-21)Online publication date: 9-Jul-2021
  • (2021)A Survey on Edge Performance BenchmarkingACM Computing Surveys10.1145/344469254:3(1-33)Online publication date: 22-Apr-2021
  • (2021)Hand Sign Recognition via Deep Learning on Tightly Coupled Processor Arrays2021 31st International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL53798.2021.00079(388-388)Online publication date: Aug-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media