[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3310273.3322831acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Analyzing the suitability of contemporary 3D-stacked PIM architectures for HPC scientific applications

Published: 30 April 2019 Publication History

Abstract

Scaling off-chip bandwidth is challenging due to fundamental limitations, such as a fixed pin count and plateauing signaling rates. Recently, vendors have turned to 2.5D and 3D stacking to closely integrate system components. Interestingly, these technologies can integrate a logic layer under multiple memory dies, enabling computing capability inside a memory stack. This trend in stacking is making PIM architectures commercially viable. In this work, we investigate the suitability of offloading kernels in scientific applications onto 3D stacked PIM architectures. We evaluate several hardware constraints resulted from the stacked structure. We perform extensive simulation experiments and in-depth analysis to quantify the impact of application locality in TLBs, data caches, and memory stacks. Our results also identify design optimization areas in software and hardware for HPC scientific applications.

References

[1]
2017. XSBench: The Monte Carlo macroscopic cross section lookup benchmark. https://github.com/ANL-CESAR/XSBench. {Online; accessed 01-January-2017}.
[2]
2018. HMC Gen2 (HMC-15G-SR) Data Sheet. https://www.micron.com/~/media/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf. {Online; accessed 19-Feb-2018}.
[3]
2018. The PENNANT Mini-App. https://github.com/lanl/PENNANT. {Online; accessed 15-Oct-2018}.
[4]
2019. ECP Proxy Applications Catalog. https://proxyapps.exascaleproject.org/app. {Online; accessed 15-Dec-2018}.
[5]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.
[6]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 336--348.
[7]
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--13.
[8]
Omid Azizi, Aqeel Mahesri, Benjamin C Lee, Sanjay J Patel, and Mark Horowitz. 2010. Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. ACM SIGARCH Computer Architecture News 38, 3 (2010), 26--36.
[9]
Kevin K Chang, Prashant J Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K Qureshi, and Onur Mutlu. 2016. Low-cost inter-linked subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 568--580.
[10]
R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256--268.
[11]
Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, et al. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the 16th international conference on Supercomputing. ACM, 14--25.
[12]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In Parallel Architecture and Compilation (PACT), 2015 International Conference on. IEEE, 113--124.
[13]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.
[14]
Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.
[15]
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Computer Architecture News 44, 3 (2016), 204--216.
[16]
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 145--156.
[17]
Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, and Gabriel Loh. 2018. CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems. ACM Transactions on Architecture and Code Optimization (TACO) 15, 3 (2018), 32.
[18]
Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. 2009. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on. IEEE, 469--480.
[19]
Haikun Liu, Yujie Chen, Xiaofei Liao, Hai Jin, Bingsheng He, Long Zheng, and Rentong Guo. 2017. Hardware/software cooperative caching for hybrid dram/nvm memory architectures. In Proceedings of the International Conference on Supercomputing. ACM, 26.
[20]
G.H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D.P. Zhang, and M. Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP).
[21]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices, Vol. 40. ACM, 190--200.
[22]
John D McCalpin. 1995. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter 19 (1995), 25.
[23]
Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, and Hyesoon Kim. 2018. CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading. In Proc. International Parallel and Distributed Processing Symposium.
[24]
Ravi Nair, Samuel F Antao, Carlo Bertolli, Pradip Bose, Jose R Brunheroto, Tong Chen, C-Y Cher, Carlos HA Costa, Jun Doi, Constantinos Evangelinos, et al. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (2015), 17--1.
[25]
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE micro 17, 2 (1997), 34--44.
[26]
I. B. Peng, R. Gioiosa, G. Kestor, P. Cicotti, E. Laure, and S. Markidis. 2017. Exploring the Performance Benefit of Hybrid Memory System on HPC Environments. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 683--692.
[27]
Matthew Poremba, Tao Zhang, and Yuan Xie. 2015. Nvmain 2.0: A user-friendly memory simulator to model (non-) volatile memory systems. IEEE Computer Architecture Letters 14, 2 (2015), 140--143.
[28]
Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.
[29]
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.
[30]
David F Richards, Ryan C Bleile, Patrick S Brantley, Shawn A Dawson, Michael Scott McKinley, and Matthew J O'Brien. 2017. Quicksilver: A Proxy App for the Monte Carlo Transport Code Mercury. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 866--873.
[31]
Brian M Rogers, Anil Krishna, Gordon B Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 371--382.
[32]
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In ACM SIGARCH Computer architecture news, Vol. 41. ACM, 475--486.
[33]
JEDEC Standard-JESD235A. 2013. High Bandwidth Memory (HBM) DRAM. JEDEC Solid State Technology Association (2013).
[34]
Mithuna Thottethodi, TN Vijaykumar, et al. 2018. Millipede: Die-Stacked Memory Optimizations for Big Data Machine Learning Analytics. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 160--171.
[35]
Erik Vermij, Leandro Fiorin, Christoph Hagleitner, and Koen Bertels. 2017. Boosting the efficiency of HPCG and Graph500 with near-data processing. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 31--40.
[36]
Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. 2018. GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE, 544--557.
[37]
Yuxiong Zhu, Borui Wang, Dong Li, and Jishen Zhao. 2016. Integrated Thermal Analysis for Processing In Die-Stacking Memory. In Proceedings of the Second International Symposium on Memory Systems. ACM, 402--414.

Cited By

View all
  • (2023)HARDWARE ACCELERATION OF DNA READ ALIGNMENT PROGRAMS: CHALLENGES AND OPPORTUNITIESFractals10.1142/S0218348X2350097431:07Online publication date: 8-Aug-2023
  • (2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
  • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021

Index Terms

  1. Analyzing the suitability of contemporary 3D-stacked PIM architectures for HPC scientific applications

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CF '19: Proceedings of the 16th ACM International Conference on Computing Frontiers
      April 2019
      414 pages
      ISBN:9781450366854
      DOI:10.1145/3310273
      © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 April 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. 3D stacked memory
      2. PIM
      3. processing-in-memory

      Qualifiers

      • Research-article

      Funding Sources

      • U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research

      Conference

      CF '19
      Sponsor:
      CF '19: Computing Frontiers Conference
      April 30 - May 2, 2019
      Alghero, Italy

      Acceptance Rates

      Overall Acceptance Rate 273 of 785 submissions, 35%

      Upcoming Conference

      CF '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)HARDWARE ACCELERATION OF DNA READ ALIGNMENT PROGRAMS: CHALLENGES AND OPPORTUNITIESFractals10.1142/S0218348X2350097431:07Online publication date: 8-Aug-2023
      • (2022)Executing Data Integration Effectively and Efficiently Near the MemoryIEEE Design & Test10.1109/MDAT.2021.306995739:2(65-73)Online publication date: Apr-2022
      • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media