[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ISCA.2018.00054acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Stitch: fusible heterogeneous accelerators enmeshed with many-core architecture for wearables

Published: 02 June 2018 Publication History

Abstract

Wearable devices are now leveraging multi-core processors to cater to the increasing computational demands of the applications via multi-threading. However, the power, performance constraints of many wearable applications can only be satisfied when the thread-level parallelism is coupled with hardware acceleration of common computational kernels. The ASIC accelerators with high performance/watt suffer from high non-recurring engineering costs. Configurable accelerators that can be reused across applications present a promising alternative. Autonomous configurable accelerators loosely-coupled to the processor occupy additional silicon area for local data and control and incur data communication overhead. In contrast, configurable instruction set extension (ISE) accelerators tightly integrated into the processor pipeline eliminate such overheads by sharing the existing processor resources. Yet, naively adding full-blown ISE accelerators to each core in a many-core architecture will lead to huge area and power overheads, which is clearly infeasible in resource-constrained wearables. In this paper, we propose Stitch, a many-core architecture where tiny, heterogeneous, configurable and fusible ISE accelerators, called polymorphic patches are effectively enmeshed with the cores. The novelty of our architecture lies in the ability to stitch together multiple polymorphic patches, where each can handle very simple ISEs, across the chip to create large, virtual accelerators that can execute complex ISEs. The virtual connections are realized efficiently with a very lightweight compiler-scheduled network-on-chip (NoC) with no buffers or control logic. Our evaluations across representative wearable applications show an average 2.3X improvement in runtime for Stitch compared to a baseline many-core processor without ISEs, at a modest area and power overhead.

References

[1]
"Huawei Watch2." https://goo.gl/cujvRb.
[2]
"ASUS ZenWatch 3." https://goo.gl/Y8tBxN.
[3]
F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, "PULP: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision," Journal of Signal Processing Systems, vol. 84, no. 3, pp. 339--354, 2016.
[4]
"Tensilica Inc." http://www.tensilica.com.
[5]
K. Atasu, L. Pozzi, and P. Ienne, "Automatic application-specific instruction-set extensions under microarchitectural constraints," in Proceedings of the 40th annual Design Automation Conference, pp. 256--261, ACM, 2003.
[6]
L. Pozzi, K. Atasu, and P. Ienne, "Exact and approximate algorithms for the extension of embedded processor instruction sets," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 7, pp. 1209--1229, 2006.
[7]
P. Yu and T. Mitra, "Characterizing embedded applications for instruction-set extensible processors," in Proceedings of the 41st annual Design Automation Conference, pp. 723--728, ACM, 2004.
[8]
P. Yu and T. Mitra, "Scalable custom instructions identification for instruction-set extensible processors," in Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pp. 69--78, ACM, 2004.
[9]
N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 30--40, IEEE, 2004.
[10]
R. E. Gonzalez, "A software-configurable processor architecture," IEEE Micro, vol. 26, no. 5, pp. 42--51, 2006.
[11]
L. Chen, J. Tarango, T. Mitra, and P. Brisk, "A just-in-time customizable processor," in Proceedings of the International Conference on Computer-Aided Design, pp. 524--531, IEEE Press, 2013.
[12]
S. Yehia, S. Girbal, H. Berry, and O. Temam, "Reconciling specialization and flexibility through compound circuits," in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on, pp. 277--288, IEEE, 2009.
[13]
"SmartWatch 2 APIs." https://goo.gl/IBGTmg.
[14]
"Samsung Gear SDK." http://goo.gl/cT4qXJ.
[15]
"AR Glasses SDK." http://goo.gl/o9Y5YM.
[16]
"Google Glass SDK." https://goo.gl/jWeUh5.
[17]
"Samsung Gear S." http://goo.gl/aE6ApL.
[18]
"Snapdragon Wear 2100 Processor. https://goo.gl/14r8sx."
[19]
S. Lee, J. Oh, J. Park, J. Kwon, M. Kim, and H.-J. Yoo, "A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition," IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 42--51, 2011.
[20]
F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp. 683--688, IEEE, 2015.
[21]
G. Tagliavini, G. Haugou, and L. Benini, "Optimizing memory bandwidth in OpenVX graph execution on embedded many-core accelerators," in Design and Architectures for Signal and Image Processing (DASIP), 2014 Conference on, pp. 1--8, IEEE, 2014.
[22]
J. Bisasky, H. Homayoun, F. Yazdani, and T. Mohsenin, "A 64-core platform for biomedical signal processing," in Quality Electronic Design (ISQED), 2013 14th International Symposium on, pp. 368--372, IEEE, 2013.
[23]
L. Bauer, M. Shafique, S. Kramer, and J. Henkel, "RISPP: rotating instruction set processing platform," in Proceedings of the 44th annual Design Automation Conference, pp. 791--796, ACM, 2007.
[24]
R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, "Plasticine: A Reconfigurable Architecture For Parallel Paterns," in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 389--402, ACM, 2017.
[25]
H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, "MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications," IEEE transactions on computers, vol. 49, no. 5, pp. 465--481, 2000.
[26]
G. Ansaloni, P. Bonzini, and L. Pozzi, "EGRA: A coarse grained reconfigurable architectural template," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 6, pp. 1062--1074, 2011.
[27]
V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, "PACT XPPA self-reconfigurable data processing architecture," the Journal of Supercomputing, vol. 26, no. 2, pp. 167--184, 2003.
[28]
Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, "Elastic cgras," in Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, pp. 171--180, ACM, 2013.
[29]
Y. Park, H. Park, and S. Mahlke, "CGRA express: accelerating execution using dynamic operation fusion," in Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pp. 271--280, ACM, 2009.
[30]
S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. R. Taylor, "PipeRench: A reconfigurable architecture and compiler," Computer, vol. 33, no. 4, pp. 70--77, 2000.
[31]
H. Park, Y. Park, and S. Mahlke, "Polymorphic pipeline array: a flexible multi-core accelerator with virtualized execution for mobile multimedia applications," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 370--380, ACM, 2009.
[32]
N. Clark, A. Hormati, and S. Mahlke, "Veal: Virtualized execution accelerator for loops," in ACM SIGARCH Computer Architecture News, vol. 36, pp. 389--400, IEEE Computer Society, 2008.
[33]
B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, "ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix," in FPL, vol. 2778, pp. 61--70, Springer, 2003.
[34]
T. J. Callahan, J. R. Hauser, and J. Wawrzynek, "The Garp architecture and C compiler," Computer, vol. 33, no. 4, pp. 62--69, 2000.
[35]
M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Goldstein, and M. Budiu, "Tartan: evaluating spatial computation for whole program execution," in ACM SIGOPS Operating Systems Review, vol. 40, pp. 163--174, ACM, 2006.
[36]
S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 12--23, ACM, 2011.
[37]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation cores: reducing the energy of mature computations," in ACM SIGARCH Computer Architecture News, vol. 38, pp. 205--218, ACM, 2010.
[38]
"IoT Kernels. https://github.com/iot-locus/kernels."
[39]
T. Krishna, C.-H. O. Chen, W. C. Kwon, and L.-S. Peh, "Breaking the on-chip latency barrier using SMART," in High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pp. 378--389, IEEE, 2013.
[40]
P. Biswas, V. Choudhary, K. Atasu, L. Pozzi, P. Ienne, and N. Dutt, "Introduction of local memory elements in instruction set extensions," in Proceedings of the 41st annual Design Automation Conference, pp. 729--734, ACM, 2004.
[41]
P. Biswas, N. Dutt, P. Ienne, and L. Pozzi, "Automatic identification of application-specific functional units with architecturally visible storage," in Proceedings of the conference on Design, automation and test in Europe: Proceedings, pp. 212--217, European Design and Automation Association, 2006.
[42]
A. Prakash, C. T. Clarke, and T. Srikanthan, "Custom instructions with local memory elements without expensive DMA transfers," in Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pp. 647--650, IEEE, 2012.
[43]
L. Alvarez, L. Vilanova, M. Moreto, M. Casas, M. González, X. Martorell, N. Navarro, E. Ayguadé, and M. Valero, "Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures," in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, pp. 720--732, IEEE, 2015.
[44]
R. Manapat and K. Srinivasagam, "Method for interfacing a synchronous memory to an asynchronous memory interface and logic of same," Sept. 20 2005. US Patent 6,948,084.
[45]
L. McMurchie and C. Ebeling, "PathFinder: a negotiation-based performance-driven router for FPGAs," in Proceedings of the 1995 ACM third international symposium on Field-programmable gate arrays, pp. 111--117, ACM, 1995.
[46]
C. Xu, P. H. Pathak, and P. Mohapatra, "Finger-writing with smartwatch: A case for finger and hand gesture recognition using smartwatch," in Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, pp. 9--14, ACM, 2015.
[47]
"Shimmer Wearable Device." http://www.shimmersensing.com.
[48]
"TI SensorTag." http://www.ti.com/ww/en/wireless_connectivity/sensortag/.
[49]
J. Redmon, "Darknet: Open source neural networks in c," h ttp://pjreddie.com/darknet, vol. 2016, 2013.
[50]
K. Sankaran, M. Zhu, X. F. Guo, A. L. Ananda, M. C. Chan, and L.-S. Peh, "Using mobile phone barometer for low-power transportation context detection," in Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems, pp. 191--205, ACM, 2014.
[51]
C. Tan, A. Kulkarni, V. Venkataramani, M. Karunaratne, T. Mitra, and L.-S. Peh, "LOCUS: low-power customizable many-core architecture for wearables," in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, p. 11, ACM, 2016.
[52]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.
[53]
"Amber ARM-Compatible Core. http://goo.gl/jshd3q."
[54]
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, "GARNET: A detailed on-chip network model inside a full-system simulator," in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 33--42, IEEE, 2009.
[55]
V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 503--514, IEEE, 2011.
[56]
J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson, and M. B. Taylor, "Efficient complex operators for irregular codes," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 491--502, IEEE, 2011.
[57]
G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, "QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 163--174, ACM, 2011.
[58]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding sources of inefficiency in general-purpose chips," in ACM SIGARCH Computer Architecture News, vol. 38, pp. 37--47, ACM, 2010.
[59]
P. Garcia and K. Compton, "Kernel sharing on reconfigurable multiprocessor systems," in ICECE Technology, 2008. FPT 2008. International Conference on, pp. 225--232, IEEE, 2008.
[60]
Y. Lu, T. Marconi, G. Gaydadjiev, and K. Bertels, "An efficient algorithm for free resources management on the FPGA," in Proceedings of the conference on Design, automation and test in Europe, pp. 1095--1098, ACM, 2008.
[61]
L. Chen and T. Mitra, "Shared reconfigurable fabric for multi-core customization," in Proceedings of the 48th Design Automation Conference, pp. 830--835, ACM, 2011.
[62]
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, "Core fusion: accommodating software diversity in chip multiprocessors," in ACM SIGARCH Computer Architecture News, vol. 35, pp. 186--197, ACM, 2007.
[63]
J. Cong, M. Gill, Y. Hao, G. Reinman, and B. Yuan, "On-chip interconnection network for accelerator-rich architectures," in Proceedings of the 52nd Annual Design Automation Conference, p. 8, ACM, 2015.
[64]
T. Moscibroda and O. Mutlu, "A case for bufferless routing in on-chip networks," in ACM SIGARCH Computer Architecture News, vol. 37, pp. 196--207, ACM, 2009.
[65]
B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, "Quest for high-performance bufferless NoCs with single-cycle express paths and self-learning throttling," in Proceedings of the 53rd Annual Design Automation Conference, p. 36, ACM, 2016.
[66]
M. Hayenga, N. E. Jerger, and M. Lipasti, "Scarab: A single cycle adaptive routing and bufferless network," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 244--254, IEEE, 2009.
[67]
Z. Li, J. San Miguel, and N. E. Jerger, "The runahead network-on-chip," in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 333--344, IEEE, 2016.
[68]
B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, "Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems," in Design Automation Conference (DAC), 2017 54th ACM/EDAC/IEEE, pp. 1--6, IEEE, 2017.
[69]
M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, et al., "The raw microprocessor: A computational fabric for software circuits and general-purpose programs," IEEE micro, vol. 22, no. 2, pp. 25--35, 2002.
[70]
M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, "HyCUBE: A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 45, ACM, 2017.
[71]
C.-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L.-S. Peh, "SMART: a single-cycle reconfigurable NoC for SoC applications," in Proceedings of the Conference on Design, Automation and Test in Europe, pp. 338--343, EDA Consortium, 2013.

Cited By

View all
  • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
  • (2022)ASAPProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532359(1-13)Online publication date: 28-Jun-2022
  • (2021)MC-DeFACM Transactions on Architecture and Code Optimization10.1145/344797018:3(1-25)Online publication date: 14-Apr-2021
  • Show More Cited By

Index Terms

  1. Stitch: fusible heterogeneous accelerators enmeshed with many-core architecture for wearables
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture
        June 2018
        884 pages
        ISBN:9781538659847

        Publisher

        IEEE Press

        Publication History

        Published: 02 June 2018

        Check for updates

        Author Tags

        1. accelerators
        2. customization
        3. low-power
        4. manycore architectures
        5. network-on-chip
        6. wearables

        Qualifiers

        • Research-article

        Conference

        ISCA '18

        Acceptance Rates

        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)11
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 30 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
        • (2022)ASAPProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532359(1-13)Online publication date: 28-Jun-2022
        • (2021)MC-DeFACM Transactions on Architecture and Code Optimization10.1145/344797018:3(1-25)Online publication date: 14-Apr-2021
        • (2021)SnafuProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00084(1027-1040)Online publication date: 14-Jun-2021
        • (2020)TransmuterProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414627(175-190)Online publication date: 30-Sep-2020

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media