More Web Proxy on the site http://driver.im/

research-article

A low-cost memory interface for high-throughput accelerators

Authors:

Chengyong WuAuthors Info & Claims

CASES '14: Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

Article No.: 11, Pages 1 - 10

https://doi.org/10.1145/2656106.2656109

Published: 12 October 2014 Publication History

Abstract

Heterogeneous multi-cores, a mix of cores and accelerators, are becoming prevalent. These accelerators are designed for both speed and energy improvements, and thus, they increasingly come with a large number of load/store ports for achieving a high degree of parallelism. However, beyond GPG-PUs, accelerators such as ASICs and CGRAs are increasingly capable of accelerating computations with irregular control flow and memory accesses; as a result, such accelerators need to be plugged to caches instead of scratchpads, and few studies focus on accelerator-to-cache interfaces. The main existing alternative are Load/Store Queues (LSQs) traditionally used to connect superscalar processors to caches and memory, but in the context of accelerators, they are overkill and could significantly reduce the area and power benefits of accelerators. Moreover, we show that they are just not fit for accelerators plugged to multi-banked caches.

In this article, we propose a fast accelerator-to-cache interface with a moderate area and power footprint compared to LSQs, even for a large number of load/store ports. For that purpose, we introduce a set of low-overhead techniques for ensuring in-order delivery of requests to/from cache banks. We synthesize and layout at 65nm the design of both our interface and an LSQ specially adapted to accelerators for a fair comparison. We find that our interface achieves on average 78% of the performance of an LSQ using only 16% of the area and 24% of the power.

References

[1]

D. H. Bailey, "Vector computer memory bank contention," Computers, IEEE Transactions on, vol. 100, no. 3, pp. 293--298, 1987.

Digital Library

[2]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad memory," in Proceedings of the tenth international symposium on Hardware/software codesign - CODES '02. New York, New York, USA: ACM Press, May 2002, p. 73.

Digital Library

[3]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha, "Accounting for memory bank contention and delay in high-bandwidth multiprocessors," Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 9, pp. 943--958, 1997.

Digital Library

[4]

M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, "Revisiting the Sequential Programming Model for Multi-Core," in International Symposium on Microarchitecture. Portland, Oregon: IEEE, Dec. 2007, pp. 69--84.

Digital Library

[5]

N. Clark, A. Hormati, and S. Mahlke, "Veal," in International Symposium on Computer Architecture, Beijing, Jun. 2008, pp. 389--400.

Digital Library

[6]

D. Comisky and C. Fuoco, "A Scalable High-Performance DMA Architecture for DSP Applications," p. 414, Sep. 2000.

Digital Library

[7]

Z. Du, A. Lingamneni, Y. Chen, K. V. Palem, O. Temam, and C. Wu, "Leveraging the Error Resilience of Machine-Learning Applications for Designing Highly Energy Efficient Accelerators," in Asia and South Pacific Design Automation Conference, 2014.

[8]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, "Dark Silicon and the End of Multicore Scaling," in Proceedings of the 38th International Symposium on Computer Architecture (ISCA), Jun. 2011.

Digital Library

[9]

K. Fan, M. Kudlur, G. S. Dasika, and S. A. Mahlke, "Bridging the computation gap between programmable processors and hardwired accelerators," in HPCA. IEEE Computer Society, 2009, pp. 313--322.

[10]

P. Francesco, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J. M. Mendias, "An integrated hardware/software approach for run-time scratchpad management," in Proceedings of the 41st annual Design Automation Conference, 2004.

Digital Library

[11]

S. Girbal, O. Temam, S. Yehia, H. Berry, and Z. Li, "A memory interface for multi-purpose multi-stream accelerators," in International conference on Compilers, architectures and synthesis for embedded systems. New York, New York, USA: ACM Press, Oct. 2010, p. 107.

Digital Library

[12]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding sources of inefficiency in general-purpose chips," in International Symposium on Computer Architecture. New York, New York, USA: ACM Press, 2010, p. 37.

Digital Library

[13]

Y. Huang, P. Ienne, O. Temam, and C. Wu, "Elastic CGRAs," in International Symposium on Field-Programmable Gate Arrays. Monterey: paper under submission, 2013.

Digital Library

[14]

H. Igehy, G. Stoll, and P. Hanrahan, "The design of a parallel graphics interface," in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH '98. New York, NY, USA: ACM, 1998, pp. 141--150.

Digital Library

[15]

V. Kathail, "Creating power-efficient application engines for SoC design," Synfora Inc., Tech. Rep., 2005.

[16]

Y. Lin, H. Lee, M. Woh, Y. Harel, S. A. Mahlke, T. N. Mudge, C. Chakrabarti, and K. Flautner, "SODA: A Low-power Architecture For Software Radio." in ISCA, 2006, pp. 89--101.

Digital Library

[17]

M. Muller, "Dark Silicon and the Internet," in EE Times "Designing with ARM" virtual conference, 2010.

[18]

I. Park, C.-l. Ooi, and T. N. Vijaykumar, "Reducing design complexity of the load/store queue," in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03), 2003.

Digital Library

[19]

M. A. Postiff, D. A. Greene, G. S. Tyson, and T. N. Mudge, "The limits of instruction level parallelism in spec95 applications," SIGARCH Comput. Archit. News, vol. 27, no. 1, pp. 31--34, Mar. 1999.

Digital Library

[20]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: balancing efficiency & flexibility in specialized computing," in International Symposium on Computer Architecture, 2013.

Digital Library

[21]

S. S., L. Wehmeyer, B. Lee, and P. Marwedel, "Assigning program and data objects to scratchpad for energy reduction," in Proceedings of the conference on Design, automation and test in Europe, 2002.

Digital Library

[22]

T. Sha, M. M. Martin, and A. Roth, "Scalable store-load forwarding via store queue index prediction," in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), 2005.

Digital Library

[23]

B. D. Sutter, P. Raghavan, and A. Lambrechts, "Coarse-Grained Reconfigurable Array Architectures," Elements, no. 1, 2010.

[24]

G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, "QsCORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores Categories and Subject Descriptors," in International Symposium on Microarchitecture, 2011.

Digital Library

[25]

N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the effects of transient faults on a high-performance processor pipeline," in Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004, pp. 61--70.

Digital Library

[26]

S. Yehia, S. Girbal, H. Berry, and O. Temam, "Reconciling specialization and flexibility through compound circuits," in International Symposium on High Performance Computer Architecture. Raleigh, North Carolina: Ieee, Feb. 2009, pp. 277--288.

Cited By

Bhosale SAgarwal V(2021)Implementation of Special Load and Store Instruction for the RST Unit2021 8th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN52536.2021.9565995(772-777)Online publication date: 26-Aug-2021
https://doi.org/10.1109/SPIN52536.2021.9565995
Liu LYang CYin SWei S(2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
https://doi.org/10.1109/TCAD.2017.2748026
Josipovic LBrisk PIenne P(2017)An Out-of-Order Load-Store Queue for Spatial ComputingACM Transactions on Embedded Computing Systems10.1145/312652516:5s(1-19)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126525
Show More Cited By

Index Terms

A low-cost memory interface for high-throughput accelerators

Recommendations

Toward a Portable Programming Environment for Distributed High Performance Accelerators
STFSSD '09: Proceedings of the 2009 Software Technologies for Future Dependable Distributed Systems

Accelerators with little power consumption per computation performance are beginning to widely spread for High Performance Computing use, instead of general-purpose CPUs with much power consumption. They are GPUs, processors of Cell architecture, and ...
Petascale computing with accelerators
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Prefetching Techniques for Near-memory Throughput Processors
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '14: Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

October 2014

241 pages

ISBN:9781450330503

DOI:10.1145/2656106

General Chairs:
Karam S. Chatha
Qualcomm Research
,
Rolf Ernst
TU Braunschweig, Germany
,
Program Chairs:
Anand Raghunathan
Purdue University
,
Ravishankar Iyer
Intel

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGBED: ACM Special Interest Group on Embedded Systems
SIGDA: ACM Special Interest Group on Design Automation
IEEE CAS
IEEE Council on Electronic Design Automation (CEDA)
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

ESWEEK'14

Sponsor:

ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK

October 12 - 17, 2014

New Delhi, India

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
191
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhosale SAgarwal V(2021)Implementation of Special Load and Store Instruction for the RST Unit2021 8th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN52536.2021.9565995(772-777)Online publication date: 26-Aug-2021
https://doi.org/10.1109/SPIN52536.2021.9565995
Liu LYang CYin SWei S(2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
https://doi.org/10.1109/TCAD.2017.2748026
Josipovic LBrisk PIenne P(2017)An Out-of-Order Load-Store Queue for Spatial ComputingACM Transactions on Embedded Computing Systems10.1145/312652516:5s(1-19)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126525
Josipovic LBrisk PIenne P(2017)An Out-of-Order Load-Store Queue for Spatial Computing2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2017.26(134-134)Online publication date: Apr-2017
https://doi.org/10.1109/FCCM.2017.26
Cota EMantovani PDi Guglielmo GCarloni L(2015)An Analysis of Accelerator Coupling in Heterogeneous ArchitecturesProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744794(1-6)Online publication date: 7-Jun-2015
https://dl.acm.org/doi/10.1145/2744769.2744794

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents