More Web Proxy on the site http://driver.im/

research-article

Open access

PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM

Authors:

Hyojin SungAuthors Info & Claims

CGO 2023: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization

Pages 249 - 262

https://doi.org/10.1145/3579990.3580009

Published: 22 February 2023 Publication History

Abstract

Processing-in-Memory (PIM) has evolved over decades into a feasible solution to addressing the exacerbating performance bottleneck with main memory by placing computational logic in or near memory. Recent proposals from DRAM manufacturers highlighted the HW constraint-aware design of PIM-enabled DRAM with specialized MAC logic, providing an order of magnitude speedup for memory-intensive operations in DL models. Although the main target for PIM acceleration did not initially include convolutional neural networks due to their high compute intensity, recent CNN models are increasingly adopting computationally lightweight implementation. Motivated by the potential for the software stack to enable CNN models on DRAM-PIM hardware without invasive changes, we propose PIMFlow, an end-to-end compiler and runtime support, to accelerate CNN models on a PIM-enabled GPU memory. PIMFlow transforms model graphs to create inter-node parallelism across GPU and PIM, explores possible task- and data-parallel execution scenarios for optimal execution time, and provides a code-generating back-end and execution engine for DRAM-PIM. PIMFlow achieves up to 82% end-to-end speedup and reduces energy consumption by 26% on average for CNN model inferences.

References

[1]

2019. Open Neural Network Exchange. https://onnx.ai/

[2]

2022. PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM. https://doi.org/10.5281/zenodo.7639153

Digital Library

[3]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 336–348. https://doi.org/10.1145/2749469.2750385

Digital Library

[4]

Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM Trans. Archit. Code Optim., 14, 2 (2017), Article 14, jun, 25 pages. issn:1544-3566 https://doi.org/10.1145/3085572

Digital Library

[5]

Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu. 2021. Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). 159–172. https://doi.org/10.1109/PACT52795.2021.00019

Digital Library

[6]

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu. 2018. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA. 316–331. isbn:9781450349116 https://doi.org/10.1145/3173162.3173177

Digital Library

[7]

E. O. Brigham and R. E. Morrow. 1967. The fast Fourier transform. IEEE Spectrum, 4, 12 (1967), 63–70. https://doi.org/10.1109/MSPEC.1967.5217220

Digital Library

[8]

Louis-Claude Canon, Loris Marchal, Bertrand Simon, and Frédéric Vivien. 2020. Online Scheduling of Task Graphs on Heterogeneous Platforms. IEEE Transactions on Parallel and Distributed Systems, 31, 3 (2020), 721–732. https://doi.org/10.1109/TPDS.2019.2942909

[9]

Kevin K. Chang. 2017. Understanding and Improving the Latency of DRAM-Based Memory Systems. https://doi.org/10.48550/ARXIV.1712.08304

[10]

Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. 2016. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. SIGMETRICS Perform. Eval. Rev., 44, 1 (2016), jun, 323–336. issn:0163-5999 https://doi.org/10.1145/2964791.2901453

Digital Library

[11]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. isbn:9781931971478

[12]

Zhi Chen, Cody Hao Yu, Trevor Morris, Jorn Tuyls, Yi-Hsiang Lai, Jared Roesch, Elliott Delaye, Vin Sharma, and Yida Wang. 2021. Bring Your Own Codegen to Deep Learning Compiler. CoRR, abs/2105.03215 (2021).

[13]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.

[14]

Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2021. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21). Association for Computing Machinery, New York, NY, USA. Article 44, 14 pages. isbn:9781450384421

Digital Library

[15]

Seunghwan Cho, Haerang Choi, Eunhyeok Park, Hyunsung Shin, and Sungjoo Yoo. 2020. McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge. IEEE Access, https://doi.org/10.1109/ACCESS.2020.3011265

[16]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arXiv:1810.04805. arxiv:1810.04805

[17]

Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The Architecture of the DIVA Processing-in-Memory Chip. isbn:1581134835 https://doi.org/10.1145/514191.514197

Digital Library

[18]

M. Gokhale, B. Holmes, and K. Iobst. 1995. Processing in memory: the Terasys massively parallel PIM array. Computer, https://doi.org/10.1109/2.375174

Digital Library

[19]

Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.

[20]

Peng Gu, Xinfeng Xie, Shuangchen Li, Dimin Niu, Hongzhong Zheng, Krishna T Malladi, and Yuan Xie. 2020. DLUX: A LUT-based near-bank accelerator for data center deep learning training workloads. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 40, 8 (2020), 1586–1599.

[21]

Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience, 24, 1 (2012), 62–72.

[22]

Ramyad Hadidi, Lifeng Nai, Hyojong Kim, and Hyesoon Kim. 2017. CAIRO: A compiler-assisted technique for enabling instruction-level offloading of processing-in-memory. ACM Transactions on Architecture and Code Optimization (TACO), 14, 4 (2017), 1–25.

Digital Library

[23]

Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. 2021. SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA. 329–345. isbn:9781450383172 https://doi.org/10.1145/3445814.3446749

Digital Library

[24]

Hyungkyu Ham, Hyunuk Cho, Minjae Kim, Jueon Park, Jeongmin Hong, Hyojin Sung, Eunhyeok Park, Euicheol Lim, and Gwangsun Kim. 2021. Near-Data Processing in Memory Expander for DNN Acceleration on GPUs. IEEE Computer Architecture Letters, 20, 2 (2021), 171–174. https://doi.org/10.1109/LCA.2021.3126450

Digital Library

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. https://doi.org/10.1109/CVPR.2016.90

[26]

Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and T. N. Vijaykumar. 2020. Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning. In Proc. ACM/IEEE 48th Annu. Int. Symp. Microarchit. https://doi.org/10.1109/MICRO50266.2020.00040

[27]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, 204–216. isbn:9781467389471 https://doi.org/10.1109/ISCA.2016.27

Digital Library

[28]

Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing. 2017. Maskrnn: Instance level video object segmentation. Advances in neural information processing systems, 30 (2017).

[29]

Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Chunling Hu, Brian T. Lewis, and Keshav Pingali. 2014. Adaptive heterogeneous scheduling for integrated GPUs. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT). 151–162. https://doi.org/10.1145/2628071.2628088

Digital Library

[30]

Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21). Association for Computing Machinery, New York, NY, USA. 738–753. isbn:9781450385572 https://doi.org/10.1145/3466752.3480063

Digital Library

[31]

Andrew Kerr, Haicheng Wu, Manish Gupta, Dustyn Blasig, Pradeep Ramini, Duane Merrill, Aniket Shivam, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Matt Nicely. 2022. CUTLASS. https://github.com/NVIDIA/cutlass

[32]

Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In ISCA. https://doi.org/10.1109/ISCA45697.2020.00047

Digital Library

[33]

Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, and Kevin Hsieh. 2017. Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). Association for Computing Machinery, New York, NY, USA. Article 24, 12 pages. isbn:9781450351140 https://doi.org/10.1145/3126908.3126965

Digital Library

[34]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Comput. Archit. Lett., https://doi.org/10.1109/LCA.2015.2414456

Digital Library

[35]

Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.1509.09308

[36]

Sunjung Lee, Jaewan Choi, Wonkyung Jung, Byeongho Kim, Jaehyun Park, Hweesoo Kim, and Jung Ho Ahn. 2022. MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units. ACM Trans. Des. Autom. Electron. Syst., 27, 5 (2022), Article 42, jun, 25 pages. issn:1084-4309 https://doi.org/10.1145/3497745

Digital Library

[37]

Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product. In ISCA. https://doi.org/10.1109/ISCA52012.2021.00013

Digital Library

[38]

Seongju Lee, Kyuyoung Kim, Sanghoon Oh, Joonhong Park, Gimoon Hong, Dongyoon Ka, Kyudong Hwang, Jeongje Park, Kyeongpil Kang, Jungyeon Kim, Junyeol Jeon, Nahsung Kim, Yongkee Kwon, Kornijcuk Vladimir, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Jaewook Lee, Donguc Ko, Younggun Jun, Keewon Cho, Ilwoong Kim, Choungki Song, Chunseok Jeong, Daehan Kwon, Jieun Jang, Il Park, Junhyun Chun, and Joohwan Cho. 2022. A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications. In Proc. IEEE Int. Solid-State Circuits Conf.

[39]

Zhe Ma, Francky Catthoor, and Johan Vounckx. 2005. Hierarchical Task Scheduler for Interleaving Subtasks on Heterogeneous Multiprocessor Platforms. In Proceedings of the 2005 Asia and South Pacific Design Automation Conference (ASP-DAC ’05). Association for Computing Machinery, New York, NY, USA. 952–955. isbn:0780387376 https://doi.org/10.1145/1120725.1120765

Digital Library

[40]

Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the Machine-Vision Package of Torch. In Proceedings of the 18th ACM International Conference on Multimedia (MM ’10). Association for Computing Machinery, New York, NY, USA. 1485–1488. isbn:9781605589336 https://doi.org/10.1145/1873951.1874254

Digital Library

[41]

Narasinga Rao Miniskar, Frank Liu, Aaron R. Young, Dwaipayan Chakraborty, and Jeffrey S. Vetter. 2021. A Hierarchical Task Scheduler for Heterogeneous Computing. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg. 57–76. isbn:978-3-030-78712-7 https://doi.org/10.1007/978-3-030-78713-4_4

Digital Library

[42]

R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59, 2/3 (2015), 17:1–17:14. https://doi.org/10.1147/JRD.2015.2409732

Digital Library

[43]

R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59, 2/3 (2015), 17:1–17:14. https://doi.org/10.1147/JRD.2015.2409732

Digital Library

[44]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. CoRR, arXiv:1906.00091. arxiv:1906.00091

[45]

Ataberk Olgun, Juan Gómez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oğuz Ergin, and Onur Mutlu. 2021. PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM. arxiv:2111.00082.

[46]

Michael Orr and Oliver Sinnen. 2021. Optimal task scheduling for partially heterogeneous systems. Parallel Comput., 107 (2021), 102815. issn:0167-8191 https://doi.org/10.1016/j.parco.2021.102815

Digital Library

[47]

Alexandros Papakonstantinou, Yun Liang, John A. Stratton, Karthik Gururaj, Deming Chen, Wen-Mei W. Hwu, and Jason Cong. 2011. Multilevel Granularity Parallelism Synthesis on FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. 178–185. https://doi.org/10.1109/FCCM.2011.29

Digital Library

[48]

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). 31–44. https://doi.org/10.1145/2967938.2967940

Digital Library

[49]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.

[50]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. 234–241.

[51]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In CVPR. https://doi.org/10.1109/CVPR.2018.00474

[52]

Hyunsung Shin, Dongyoung Kim, Eunhyeok Park, Sungho Park, Yongsik Park, and Sungjoo Yoo. 2018. McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., https://doi.org/10.1109/TCAD.2018.2857044

[53]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556

[54]

Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. 2010. Maestro: Data Orchestration and Tuning for OpenCL Devices. In Euro-Par 2010 - Parallel Processing, Pasqua D’Ambra, Mario Guarracino, and Domenico Talia (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 275–286. isbn:978-3-642-15291-7

[55]

Jaspar Subhlok and Gary Vondran. 1996. Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’96). Association for Computing Machinery, New York, NY, USA. 62–71. isbn:0897918096 https://doi.org/10.1145/237502.237508

Digital Library

[56]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR. https://doi.org/10.1109/CVPR.2019.00293

[57]

Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. CoRR, arXiv:1905.11946. arxiv:1905.11946

[58]

Yizhou Wei, Minxuan Zhou, Sihang Liu, Korakit Seemakhupt, Tajana Rosing, and Samira Khan. 2022. PIMProf: An Automated Program Profiler for Processing-in-Memory Offloading Decisions. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). 855–860. https://doi.org/10.23919/DATE54114.2022.9774560

[59]

Yudong Wu, Mingyao Shen, Yi-Hui Chen, and Yuanyuan Zhou. 2020. Tuning applications for efficient GPU offloading to in-memory processing. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.

Digital Library

[60]

Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 570–583.

[61]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.). 4, 204–216. https://proceedings.mlsys.org/paper/2022/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf

[62]

Liang Yan, Mingzhe Zhang, Rujia Wang, Xiaoming Chen, Xingqi Zou, Xiaoyang Lu, Yinhe Han, and Xian-He Sun. 2021. CoPIM: A Concurrency-aware PIM Workload Offloading Architecture for Graph Applications. In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 1–6. https://doi.org/10.1109/ISLPED52811.2021.9502483

Digital Library

[63]

Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. Selective Replication in Memory-Side GPU Caches. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 967–980. https://doi.org/10.1109/MICRO50266.2020.00082

Cited By

Giannoula CYang PFernandez IYang JDurvasula SLi YSadrosadati MLuna JMutlu OPekhimenko G(2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700434

Index Terms

PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM
1. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
      1. Emerging architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and ...
Multi-level queue NVM/DRAM hybrid memory management with language runtime support
RACS '15: Proceedings of the 2015 Conference on research in adaptive and convergent systems

Non-volatile memory devices (NVM) devices, such as PCM, STT-MRAM, and ReRAM, enable the integration of secondary storage into main memory. This integration reduces I/O access to slow block devices; however, it is currently unrealistic to construct a ...
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '23: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization

February 2023

262 pages

ISBN:9798400701016

DOI:10.1145/3579990

General Chair:
Christophe Dubach
McGill University, Canada
,
Program Chairs:
Derek Bruening
Google, USA
,
Ben Hardekopf
University of California at Santa Barbara, USA

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

CGO '23

Sponsor:

CGO '23: 21st ACM/IEEE International Symposium on Code Generation and Optimization

February 25 - March 1, 2023

QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
2,041
Total Downloads

Downloads (Last 12 months)1,082
Downloads (Last 6 weeks)75

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Giannoula CYang PFernandez IYang JDurvasula SLi YSadrosadati MLuna JMutlu OPekhimenko G(2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700434

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents