[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2656106.2656109acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

A low-cost memory interface for high-throughput accelerators

Published: 12 October 2014 Publication History

Abstract

Heterogeneous multi-cores, a mix of cores and accelerators, are becoming prevalent. These accelerators are designed for both speed and energy improvements, and thus, they increasingly come with a large number of load/store ports for achieving a high degree of parallelism. However, beyond GPG-PUs, accelerators such as ASICs and CGRAs are increasingly capable of accelerating computations with irregular control flow and memory accesses; as a result, such accelerators need to be plugged to caches instead of scratchpads, and few studies focus on accelerator-to-cache interfaces. The main existing alternative are Load/Store Queues (LSQs) traditionally used to connect superscalar processors to caches and memory, but in the context of accelerators, they are overkill and could significantly reduce the area and power benefits of accelerators. Moreover, we show that they are just not fit for accelerators plugged to multi-banked caches.
In this article, we propose a fast accelerator-to-cache interface with a moderate area and power footprint compared to LSQs, even for a large number of load/store ports. For that purpose, we introduce a set of low-overhead techniques for ensuring in-order delivery of requests to/from cache banks. We synthesize and layout at 65nm the design of both our interface and an LSQ specially adapted to accelerators for a fair comparison. We find that our interface achieves on average 78% of the performance of an LSQ using only 16% of the area and 24% of the power.

References

[1]
D. H. Bailey, "Vector computer memory bank contention," Computers, IEEE Transactions on, vol. 100, no. 3, pp. 293--298, 1987.
[2]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad memory," in Proceedings of the tenth international symposium on Hardware/software codesign - CODES '02. New York, New York, USA: ACM Press, May 2002, p. 73.
[3]
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha, "Accounting for memory bank contention and delay in high-bandwidth multiprocessors," Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 9, pp. 943--958, 1997.
[4]
M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, "Revisiting the Sequential Programming Model for Multi-Core," in International Symposium on Microarchitecture. Portland, Oregon: IEEE, Dec. 2007, pp. 69--84.
[5]
N. Clark, A. Hormati, and S. Mahlke, "Veal," in International Symposium on Computer Architecture, Beijing, Jun. 2008, pp. 389--400.
[6]
D. Comisky and C. Fuoco, "A Scalable High-Performance DMA Architecture for DSP Applications," p. 414, Sep. 2000.
[7]
Z. Du, A. Lingamneni, Y. Chen, K. V. Palem, O. Temam, and C. Wu, "Leveraging the Error Resilience of Machine-Learning Applications for Designing Highly Energy Efficient Accelerators," in Asia and South Pacific Design Automation Conference, 2014.
[8]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, "Dark Silicon and the End of Multicore Scaling," in Proceedings of the 38th International Symposium on Computer Architecture (ISCA), Jun. 2011.
[9]
K. Fan, M. Kudlur, G. S. Dasika, and S. A. Mahlke, "Bridging the computation gap between programmable processors and hardwired accelerators," in HPCA. IEEE Computer Society, 2009, pp. 313--322.
[10]
P. Francesco, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J. M. Mendias, "An integrated hardware/software approach for run-time scratchpad management," in Proceedings of the 41st annual Design Automation Conference, 2004.
[11]
S. Girbal, O. Temam, S. Yehia, H. Berry, and Z. Li, "A memory interface for multi-purpose multi-stream accelerators," in International conference on Compilers, architectures and synthesis for embedded systems. New York, New York, USA: ACM Press, Oct. 2010, p. 107.
[12]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding sources of inefficiency in general-purpose chips," in International Symposium on Computer Architecture. New York, New York, USA: ACM Press, 2010, p. 37.
[13]
Y. Huang, P. Ienne, O. Temam, and C. Wu, "Elastic CGRAs," in International Symposium on Field-Programmable Gate Arrays. Monterey: paper under submission, 2013.
[14]
H. Igehy, G. Stoll, and P. Hanrahan, "The design of a parallel graphics interface," in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH '98. New York, NY, USA: ACM, 1998, pp. 141--150.
[15]
V. Kathail, "Creating power-efficient application engines for SoC design," Synfora Inc., Tech. Rep., 2005.
[16]
Y. Lin, H. Lee, M. Woh, Y. Harel, S. A. Mahlke, T. N. Mudge, C. Chakrabarti, and K. Flautner, "SODA: A Low-power Architecture For Software Radio." in ISCA, 2006, pp. 89--101.
[17]
M. Muller, "Dark Silicon and the Internet," in EE Times "Designing with ARM" virtual conference, 2010.
[18]
I. Park, C.-l. Ooi, and T. N. Vijaykumar, "Reducing design complexity of the load/store queue," in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03), 2003.
[19]
M. A. Postiff, D. A. Greene, G. S. Tyson, and T. N. Mudge, "The limits of instruction level parallelism in spec95 applications," SIGARCH Comput. Archit. News, vol. 27, no. 1, pp. 31--34, Mar. 1999.
[20]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: balancing efficiency & flexibility in specialized computing," in International Symposium on Computer Architecture, 2013.
[21]
S. S., L. Wehmeyer, B. Lee, and P. Marwedel, "Assigning program and data objects to scratchpad for energy reduction," in Proceedings of the conference on Design, automation and test in Europe, 2002.
[22]
T. Sha, M. M. Martin, and A. Roth, "Scalable store-load forwarding via store queue index prediction," in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), 2005.
[23]
B. D. Sutter, P. Raghavan, and A. Lambrechts, "Coarse-Grained Reconfigurable Array Architectures," Elements, no. 1, 2010.
[24]
G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, "QsCORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores Categories and Subject Descriptors," in International Symposium on Microarchitecture, 2011.
[25]
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, "Characterizing the effects of transient faults on a high-performance processor pipeline," in Dependable Systems and Networks, 2004 International Conference on. IEEE, 2004, pp. 61--70.
[26]
S. Yehia, S. Girbal, H. Berry, and O. Temam, "Reconciling specialization and flexibility through compound circuits," in International Symposium on High Performance Computer Architecture. Raleigh, North Carolina: Ieee, Feb. 2009, pp. 277--288.

Cited By

View all
  • (2021)Implementation of Special Load and Store Instruction for the RST Unit2021 8th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN52536.2021.9565995(772-777)Online publication date: 26-Aug-2021
  • (2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
  • (2017)An Out-of-Order Load-Store Queue for Spatial ComputingACM Transactions on Embedded Computing Systems10.1145/312652516:5s(1-19)Online publication date: 27-Sep-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CASES '14: Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems
October 2014
241 pages
ISBN:9781450330503
DOI:10.1145/2656106
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ESWEEK'14
ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK
October 12 - 17, 2014
New Delhi, India

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Implementation of Special Load and Store Instruction for the RST Unit2021 8th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN52536.2021.9565995(772-777)Online publication date: 26-Aug-2021
  • (2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
  • (2017)An Out-of-Order Load-Store Queue for Spatial ComputingACM Transactions on Embedded Computing Systems10.1145/312652516:5s(1-19)Online publication date: 27-Sep-2017
  • (2017)An Out-of-Order Load-Store Queue for Spatial Computing2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM.2017.26(134-134)Online publication date: Apr-2017
  • (2015)An Analysis of Accelerator Coupling in Heterogeneous ArchitecturesProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744794(1-6)Online publication date: 7-Jun-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media