More Web Proxy on the site http://driver.im/

research-article

Analyzing the suitability of contemporary 3D-stacked PIM architectures for HPC scientific applications

Authors:

Jeffrey S. Vetter,

Rakshit Joydeep,

Stefano MarkidisAuthors Info & Claims

CF '19: Proceedings of the 16th ACM International Conference on Computing Frontiers

Pages 256 - 262

https://doi.org/10.1145/3310273.3322831

Published: 30 April 2019 Publication History

Abstract

Scaling off-chip bandwidth is challenging due to fundamental limitations, such as a fixed pin count and plateauing signaling rates. Recently, vendors have turned to 2.5D and 3D stacking to closely integrate system components. Interestingly, these technologies can integrate a logic layer under multiple memory dies, enabling computing capability inside a memory stack. This trend in stacking is making PIM architectures commercially viable. In this work, we investigate the suitability of offloading kernels in scientific applications onto 3D stacked PIM architectures. We evaluate several hardware constraints resulted from the stacked structure. We perform extensive simulation experiments and in-depth analysis to quantify the impact of application locality in TLBs, data caches, and memory stacks. Our results also identify design optimization areas in software and hardware for HPC scientific applications.

References

[1]

2017. XSBench: The Monte Carlo macroscopic cross section lookup benchmark. https://github.com/ANL-CESAR/XSBench. {Online; accessed 01-January-2017}.

[2]

2018. HMC Gen2 (HMC-15G-SR) Data Sheet. https://www.micron.com/~/media/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. {Online; accessed 19-Feb-2018}.

[3]

2018. The PENNANT Mini-App. https://github.com/lanl/PENNANT. {Online; accessed 15-Oct-2018}.

[4]

2019. ECP Proxy Applications Catalog. https://proxyapps.exascaleproject.org/app. {Online; accessed 15-Dec-2018}.

[5]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.

[6]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 336--348.

Digital Library

[7]

Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--13.

Digital Library

[8]

Omid Azizi, Aqeel Mahesri, Benjamin C Lee, Sanjay J Patel, and Mark Horowitz. 2010. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. ACM SIGARCH Computer Architecture News 38, 3 (2010), 26--36.

Digital Library

[9]

Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K Qureshi, and Onur Mutlu. 2016. Low-cost inter-linked subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 568--580.

[10]

R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256--268.

[11]

Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, et al. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the 16th international conference on Supercomputing. ACM, 14--25.

Digital Library

[12]

Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Parallel Architecture and Compilation (PACT), 2015 International Conference on. IEEE, 113--124.

Digital Library

[13]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.

[14]

Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.

[15]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Computer Architecture News 44, 3 (2016), 204--216.

Digital Library

[16]

Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 145--156.

Digital Library

[17]

Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, and Gabriel Loh. 2018. CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems. ACM Transactions on Architecture and Code Optimization (TACO) 15, 3 (2018), 32.

Digital Library

[18]

Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. 2009. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on. IEEE, 469--480.

Digital Library

[19]

Haikun Liu, Yujie Chen, Xiaofei Liao, Hai Jin, Bingsheng He, Long Zheng, and Rentong Guo. 2017. Hardware/software cooperative caching for hybrid dram/nvm memory architectures. In Proceedings of the International Conference on Supercomputing. ACM, 26.

Digital Library

[20]

G.H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D.P. Zhang, and M. Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP).

[21]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices, Vol. 40. ACM, 190--200.

Digital Library

[22]

John D McCalpin. 1995. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter 19 (1995), 25.

[23]

Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, and Hyesoon Kim. 2018. CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading. In Proc. International Parallel and Distributed Processing Symposium.

[24]

Ravi Nair, Samuel F Antao, Carlo Bertolli, Pradip Bose, Jose R Brunheroto, Tong Chen, C-Y Cher, Carlos HA Costa, Jun Doi, Constantinos Evangelinos, et al. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (2015), 17--1.

Digital Library

[25]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE micro 17, 2 (1997), 34--44.

Digital Library

[26]

I. B. Peng, R. Gioiosa, G. Kestor, P. Cicotti, E. Laure, and S. Markidis. 2017. Exploring the Performance Benefit of Hybrid Memory System on HPC Environments. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 683--692.

[27]

Matthew Poremba, Tao Zhang, and Yuan Xie. 2015. Nvmain 2.0: A user-friendly memory simulator to model (non-) volatile memory systems. IEEE Computer Architecture Letters 14, 2 (2015), 140--143.

Digital Library

[28]

Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.

[29]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.

[30]

David F Richards, Ryan C Bleile, Patrick S Brantley, Shawn A Dawson, Michael Scott McKinley, and Matthew J O'Brien. 2017. Quicksilver: A Proxy App for the Monte Carlo Transport Code Mercury. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 866--873.

[31]

Brian M Rogers, Anil Krishna, Gordon B Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 371--382.

Digital Library

[32]

Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In ACM SIGARCH Computer architecture news, Vol. 41. ACM, 475--486.

Digital Library

[33]

JEDEC Standard-JESD235A. 2013. High Bandwidth Memory (HBM) DRAM. JEDEC Solid State Technology Association (2013).

[34]

Mithuna Thottethodi, TN Vijaykumar, et al. 2018. Millipede: Die-Stacked Memory Optimizations for Big Data Machine Learning Analytics. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 160--171.

[35]

Erik Vermij, Leandro Fiorin, Christoph Hagleitner, and Koen Bertels. 2017. Boosting the efficiency of HPCG and Graph500 with near-data processing. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 31--40.

[36]

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 544--557.

[37]

Yuxiong Zhu, Borui Wang, Dong Li, and Jishen Zhao. 2016. Integrated Thermal Analysis for Processing In Die-Stacking Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 402--414.

Digital Library

Cited By

DANIEL PRICARDO CIGNACIO AMIGUEL P(2023)HARDWARE ACCELERATION OF DNA READ ALIGNMENT PROGRAMS: CHALLENGES AND OPPORTUNITIESFractals10.1142/S0218348X2350097431:07Online publication date: 8-Aug-2023
https://doi.org/10.1142/S0218348X23500974
Zhao CZhang XChamberlain R(2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
https://doi.org/10.1109/MDAT.2021.3069957
Rai SSivasubramaniam AKumar ARengasamy PNarayanan VAkel AEilert SPalesi MTumeo AGoumas GAlmudever C(2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
https://dl.acm.org/doi/10.1145/3457388.3458661

Index Terms

Analyzing the suitability of contemporary 3D-stacked PIM architectures for HPC scientific applications
1. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
      1. Emerging architectures
    2. Memory and dense storage

Recommendations

3D-Xpath: high-density managed DRAM architecture with cost-effective alternative paths for memory transactions
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

The advance of DRAM manufacturing technology slows down, whereas the density and performance needs of DRAM continue to increase. This desire has motivated the industry to explore emerging Non-Volatile Memory (e.g., 3D XPoint) and the high-density DRAM (...
A Critical Assessment of DRAM-PIM Architectures - Trends, Challenges and Solutions
Embedded Computer Systems: Architectures, Modeling, and Simulation
Abstract
Recently, we are witnessing a surge in DRAM-based Processing in Memory (PIM) publications from academia and industry. The architectures and design techniques proposed in these publications vary largely, ranging from integration of computation ...
Accelerating Graph Computations on 3D NoC-Enabled PIM Architectures
Graph application workloads are dominated by random memory accesses with the poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '19: Proceedings of the 16th ACM International Conference on Computing Frontiers

April 2019

414 pages

ISBN:9781450366854

DOI:10.1145/3310273

General Chairs:
Francesca Palumbo
Universit`a degli Studi di Sassari, IT
,
Michela Becchi
North Carolina State University
,
Program Chairs:
Martin Schulz
Technical University of Munich, DE
,
Kento Sato
RIKEN R-CCS, JP

Copyright © 2019 ACM.

© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research

Conference

CF '19

Sponsor:

SIGMICRO

CF '19: Computing Frontiers Conference

April 30 - May 2, 2019

Alghero, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

DANIEL PRICARDO CIGNACIO AMIGUEL P(2023)HARDWARE ACCELERATION OF DNA READ ALIGNMENT PROGRAMS: CHALLENGES AND OPPORTUNITIESFractals10.1142/S0218348X2350097431:07Online publication date: 8-Aug-2023
https://doi.org/10.1142/S0218348X23500974
Zhao CZhang XChamberlain R(2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
https://doi.org/10.1109/MDAT.2021.3069957
Rai SSivasubramaniam AKumar ARengasamy PNarayanan VAkel AEilert SPalesi MTumeo AGoumas GAlmudever C(2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
https://dl.acm.org/doi/10.1145/3457388.3458661

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten