More Web Proxy on the site http://driver.im/

research-article

Stitch: fusible heterogeneous accelerators enmeshed with many-core architecture for wearables

Authors:

Manupa Karunaratne,

Li-Shiuan PehAuthors Info & Claims

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Pages 575 - 587

https://doi.org/10.1109/ISCA.2018.00054

Published: 02 June 2018 Publication History

Abstract

Wearable devices are now leveraging multi-core processors to cater to the increasing computational demands of the applications via multi-threading. However, the power, performance constraints of many wearable applications can only be satisfied when the thread-level parallelism is coupled with hardware acceleration of common computational kernels. The ASIC accelerators with high performance/watt suffer from high non-recurring engineering costs. Configurable accelerators that can be reused across applications present a promising alternative. Autonomous configurable accelerators loosely-coupled to the processor occupy additional silicon area for local data and control and incur data communication overhead. In contrast, configurable instruction set extension (ISE) accelerators tightly integrated into the processor pipeline eliminate such overheads by sharing the existing processor resources. Yet, naively adding full-blown ISE accelerators to each core in a many-core architecture will lead to huge area and power overheads, which is clearly infeasible in resource-constrained wearables. In this paper, we propose Stitch, a many-core architecture where tiny, heterogeneous, configurable and fusible ISE accelerators, called polymorphic patches are effectively enmeshed with the cores. The novelty of our architecture lies in the ability to stitch together multiple polymorphic patches, where each can handle very simple ISEs, across the chip to create large, virtual accelerators that can execute complex ISEs. The virtual connections are realized efficiently with a very lightweight compiler-scheduled network-on-chip (NoC) with no buffers or control logic. Our evaluations across representative wearable applications show an average 2.3X improvement in runtime for Stitch compared to a baseline many-core processor without ISEs, at a modest area and power overhead.

References

[1]

"Huawei Watch2." https://goo.gl/cujvRb.

[2]

"ASUS ZenWatch 3." https://goo.gl/Y8tBxN.

[3]

F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, "PULP: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision," Journal of Signal Processing Systems, vol. 84, no. 3, pp. 339--354, 2016.

Digital Library

[4]

"Tensilica Inc." http://www.tensilica.com.

[5]

K. Atasu, L. Pozzi, and P. Ienne, "Automatic application-specific instruction-set extensions under microarchitectural constraints," in Proceedings of the 40th annual Design Automation Conference, pp. 256--261, ACM, 2003.

Digital Library

[6]

L. Pozzi, K. Atasu, and P. Ienne, "Exact and approximate algorithms for the extension of embedded processor instruction sets," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 7, pp. 1209--1229, 2006.

Digital Library

[7]

P. Yu and T. Mitra, "Characterizing embedded applications for instruction-set extensible processors," in Proceedings of the 41st annual Design Automation Conference, pp. 723--728, ACM, 2004.

Digital Library

[8]

P. Yu and T. Mitra, "Scalable custom instructions identification for instruction-set extensible processors," in Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pp. 69--78, ACM, 2004.

Digital Library

[9]

N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 30--40, IEEE, 2004.

Digital Library

[10]

R. E. Gonzalez, "A software-configurable processor architecture," IEEE Micro, vol. 26, no. 5, pp. 42--51, 2006.

Digital Library

[11]

L. Chen, J. Tarango, T. Mitra, and P. Brisk, "A just-in-time customizable processor," in Proceedings of the International Conference on Computer-Aided Design, pp. 524--531, IEEE Press, 2013.

Digital Library

[12]

S. Yehia, S. Girbal, H. Berry, and O. Temam, "Reconciling specialization and flexibility through compound circuits," in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, pp. 277--288, IEEE, 2009.

[13]

"SmartWatch 2 APIs." https://goo.gl/IBGTmg.

[14]

"Samsung Gear SDK." http://goo.gl/cT4qXJ.

[15]

"AR Glasses SDK." http://goo.gl/o9Y5YM.

[16]

"Google Glass SDK." https://goo.gl/jWeUh5.

[17]

"Samsung Gear S." http://goo.gl/aE6ApL.

[18]

"Snapdragon Wear 2100 Processor. https://goo.gl/14r8sx."

[19]

S. Lee, J. Oh, J. Park, J. Kwon, M. Kim, and H.-J. Yoo, "A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition," IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 42--51, 2011.

[20]

F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp. 683--688, IEEE, 2015.

Digital Library

[21]

G. Tagliavini, G. Haugou, and L. Benini, "Optimizing memory bandwidth in OpenVX graph execution on embedded many-core accelerators," in Design and Architectures for Signal and Image Processing (DASIP), 2014 Conference on, pp. 1--8, IEEE, 2014.

[22]

J. Bisasky, H. Homayoun, F. Yazdani, and T. Mohsenin, "A 64-core platform for biomedical signal processing," in Quality Electronic Design (ISQED), 2013 14th International Symposium on, pp. 368--372, IEEE, 2013.

[23]

L. Bauer, M. Shafique, S. Kramer, and J. Henkel, "RISPP: rotating instruction set processing platform," in Proceedings of the 44th annual Design Automation Conference, pp. 791--796, ACM, 2007.

Digital Library

[24]

R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, "Plasticine: A Reconfigurable Architecture For Parallel Paterns," in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 389--402, ACM, 2017.

Digital Library

[25]

H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, "MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications," IEEE transactions on computers, vol. 49, no. 5, pp. 465--481, 2000.

Digital Library

[26]

G. Ansaloni, P. Bonzini, and L. Pozzi, "EGRA: A coarse grained reconfigurable architectural template," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 6, pp. 1062--1074, 2011.

Digital Library

[27]

V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, "PACT XPPA self-reconfigurable data processing architecture," the Journal of Supercomputing, vol. 26, no. 2, pp. 167--184, 2003.

Digital Library

[28]

Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, "Elastic cgras," in Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, pp. 171--180, ACM, 2013.

Digital Library

[29]

Y. Park, H. Park, and S. Mahlke, "CGRA express: accelerating execution using dynamic operation fusion," in Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pp. 271--280, ACM, 2009.

Digital Library

[30]

S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R. Taylor, "PipeRench: A reconfigurable architecture and compiler," Computer, vol. 33, no. 4, pp. 70--77, 2000.

Digital Library

[31]

H. Park, Y. Park, and S. Mahlke, "Polymorphic pipeline array: a flexible multi-core accelerator with virtualized execution for mobile multimedia applications," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 370--380, ACM, 2009.

Digital Library

[32]

N. Clark, A. Hormati, and S. Mahlke, "Veal: Virtualized execution accelerator for loops," in ACM SIGARCH Computer Architecture News, vol. 36, pp. 389--400, IEEE Computer Society, 2008.

Digital Library

[33]

B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, "ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix," in FPL, vol. 2778, pp. 61--70, Springer, 2003.

[34]

T. J. Callahan, J. R. Hauser, and J. Wawrzynek, "The Garp architecture and C compiler," Computer, vol. 33, no. 4, pp. 62--69, 2000.

Digital Library

[35]

M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Goldstein, and M. Budiu, "Tartan: evaluating spatial computation for whole program execution," in ACM SIGOPS Operating Systems Review, vol. 40, pp. 163--174, ACM, 2006.

Digital Library

[36]

S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 12--23, ACM, 2011.

Digital Library

[37]

G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation cores: reducing the energy of mature computations," in ACM SIGARCH Computer Architecture News, vol. 38, pp. 205--218, ACM, 2010.

Digital Library

[38]

"IoT Kernels. https://github.com/iot-locus/kernels."

[39]

T. Krishna, C.-H. O. Chen, W. C. Kwon, and L.-S. Peh, "Breaking the on-chip latency barrier using SMART," in High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pp. 378--389, IEEE, 2013.

Digital Library

[40]

P. Biswas, V. Choudhary, K. Atasu, L. Pozzi, P. Ienne, and N. Dutt, "Introduction of local memory elements in instruction set extensions," in Proceedings of the 41st annual Design Automation Conference, pp. 729--734, ACM, 2004.

Digital Library

[41]

P. Biswas, N. Dutt, P. Ienne, and L. Pozzi, "Automatic identification of application-specific functional units with architecturally visible storage," in Proceedings of the conference on Design, automation and test in Europe: Proceedings, pp. 212--217, European Design and Automation Association, 2006.

Digital Library

[42]

A. Prakash, C. T. Clarke, and T. Srikanthan, "Custom instructions with local memory elements without expensive DMA transfers," in Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pp. 647--650, IEEE, 2012.

[43]

L. Alvarez, L. Vilanova, M. Moreto, M. Casas, M. González, X. Martorell, N. Navarro, E. Ayguadé, and M. Valero, "Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures," in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, pp. 720--732, IEEE, 2015.

Digital Library

[44]

R. Manapat and K. Srinivasagam, "Method for interfacing a synchronous memory to an asynchronous memory interface and logic of same," Sept. 20 2005. US Patent 6,948,084.

[45]

L. McMurchie and C. Ebeling, "PathFinder: a negotiation-based performance-driven router for FPGAs," in Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, pp. 111--117, ACM, 1995.

Digital Library

[46]

C. Xu, P. H. Pathak, and P. Mohapatra, "Finger-writing with smartwatch: A case for finger and hand gesture recognition using smartwatch," in Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, pp. 9--14, ACM, 2015.

Digital Library

[47]

"Shimmer Wearable Device." http://www.shimmersensing.com.

[48]

"TI SensorTag." http://www.ti.com/ww/en/wireless_connectivity/sensortag/.

[49]

J. Redmon, "Darknet: Open source neural networks in c," h ttp://pjreddie.com/darknet, vol. 2016, 2013.

[50]

K. Sankaran, M. Zhu, X. F. Guo, A. L. Ananda, M. C. Chan, and L.-S. Peh, "Using mobile phone barometer for low-power transportation context detection," in Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems, pp. 191--205, ACM, 2014.

Digital Library

[51]

C. Tan, A. Kulkarni, V. Venkataramani, M. Karunaratne, T. Mitra, and L.-S. Peh, "LOCUS: low-power customizable many-core architecture for wearables," in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, p. 11, ACM, 2016.

Digital Library

[52]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.

Digital Library

[53]

"Amber ARM-Compatible Core. http://goo.gl/jshd3q."

[54]

N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A detailed on-chip network model inside a full-system simulator," in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 33--42, IEEE, 2009.

[55]

V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 503--514, IEEE, 2011.

Digital Library

[56]

J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson, and M. B. Taylor, "Efficient complex operators for irregular codes," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 491--502, IEEE, 2011.

Digital Library

[57]

G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, "QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 163--174, ACM, 2011.

Digital Library

[58]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding sources of inefficiency in general-purpose chips," in ACM SIGARCH Computer Architecture News, vol. 38, pp. 37--47, ACM, 2010.

Digital Library

[59]

P. Garcia and K. Compton, "Kernel sharing on reconfigurable multiprocessor systems," in ICECE Technology, 2008. FPT 2008. International Conference on, pp. 225--232, IEEE, 2008.

[60]

Y. Lu, T. Marconi, G. Gaydadjiev, and K. Bertels, "An efficient algorithm for free resources management on the FPGA," in Proceedings of the conference on Design, automation and test in Europe, pp. 1095--1098, ACM, 2008.

Digital Library

[61]

L. Chen and T. Mitra, "Shared reconfigurable fabric for multi-core customization," in Proceedings of the 48th Design Automation Conference, pp. 830--835, ACM, 2011.

Digital Library

[62]

E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, "Core fusion: accommodating software diversity in chip multiprocessors," in ACM SIGARCH Computer Architecture News, vol. 35, pp. 186--197, ACM, 2007.

Digital Library

[63]

J. Cong, M. Gill, Y. Hao, G. Reinman, and B. Yuan, "On-chip interconnection network for accelerator-rich architectures," in Proceedings of the 52nd Annual Design Automation Conference, p. 8, ACM, 2015.

Digital Library

[64]

T. Moscibroda and O. Mutlu, "A case for bufferless routing in on-chip networks," in ACM SIGARCH Computer Architecture News, vol. 37, pp. 196--207, ACM, 2009.

Digital Library

[65]

B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, "Quest for high-performance bufferless NoCs with single-cycle express paths and self-learning throttling," in Proceedings of the 53rd Annual Design Automation Conference, p. 36, ACM, 2016.

Digital Library

[66]

M. Hayenga, N. E. Jerger, and M. Lipasti, "Scarab: A single cycle adaptive routing and bufferless network," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 244--254, IEEE, 2009.

Digital Library

[67]

Z. Li, J. San Miguel, and N. E. Jerger, "The runahead network-on-chip," in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 333--344, IEEE, 2016.

[68]

B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, "Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems," in Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE, pp. 1--6, IEEE, 2017.

Digital Library

[69]

M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, et al., "The raw microprocessor: A computational fabric for software circuits and general-purpose programs," IEEE micro, vol. 22, no. 2, pp. 25--35, 2002.

Digital Library

[70]

M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, "HyCUBE: A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 45, ACM, 2017.

Digital Library

[71]

C.-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L.-S. Peh, "SMART: a single-cycle reconfigurable NoC for SoC applications," in Proceedings of the Conference on Design, Automation and Test in Europe, pp. 338--343, EDA Consortium, 2013.

Digital Library

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Tan CTambe TZhang JFang BGeng TWei GBrooks DTumeo AGopalakrishnan GLi ARauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)ASAPProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532359(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532359
Charitopoulos GPnevmatikatos DGaydadjiev G(2021)MC-DeFACM Transactions on Architecture and Code Optimization10.1145/344797018:3(1-25)Online publication date: 14-Apr-2021
https://dl.acm.org/doi/10.1145/3447970
Show More Cited By

Index Terms

Stitch: fusible heterogeneous accelerators enmeshed with many-core architecture for wearables
1. Computer systems organization
  1. Architectures
    1. Other architectures
  2. Embedded and cyber-physical systems
2. Hardware

Index terms have been assigned to the content through auto-classification.

Recommendations

Performance Characterization of Parallel Discrete Event Simulation on Knights Landing Processor
SIGSIM-PADS '17: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. However, the low cost of on-chip communication in emerging many-...
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance Computing

This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, ...
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
High Performance Computing with the Cell Broadband Engine

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i ¹ processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ ² architecture ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

June 2018

884 pages

ISBN:9781538659847

Publisher

IEEE Press

Publication History

Published: 02 June 2018

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '18

ISCA '18: The 45th Annual International Symposium on Computer Architecture

June 2 - 6, 2018

California, Los Angeles

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
158
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Suresh VMishra BJing YZhu ZJin NBlock CMantovani PGiri DZuckerman JCarloni LAdve S(2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676896
Tan CTambe TZhang JFang BGeng TWei GBrooks DTumeo AGopalakrishnan GLi ARauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)ASAPProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532359(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532359
Charitopoulos GPnevmatikatos DGaydadjiev G(2021)MC-DeFACM Transactions on Architecture and Code Optimization10.1145/344797018:3(1-25)Online publication date: 14-Apr-2021
https://dl.acm.org/doi/10.1145/3447970
Gobieski GAtli AMai KLucia BBeckmann NMartínez JDuato JJohn L(2021)SnafuProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00084(1027-1040)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00084
Pal SFeng SPark DKim SAmarnath AYang CHe XBeaumont JMay KXiong YKaszyk KMorton JSun JO'Boyle MCole MChakrabarti CBlaauw DKim HMudge TDreslinski RSarkar VKim H(2020)TransmuterProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414627(175-190)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414627

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents