[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3579990.3580009acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article
Open access

PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM

Published: 22 February 2023 Publication History

Abstract

Processing-in-Memory (PIM) has evolved over decades into a feasible solution to addressing the exacerbating performance bottleneck with main memory by placing computational logic in or near memory. Recent proposals from DRAM manufacturers highlighted the HW constraint-aware design of PIM-enabled DRAM with specialized MAC logic, providing an order of magnitude speedup for memory-intensive operations in DL models. Although the main target for PIM acceleration did not initially include convolutional neural networks due to their high compute intensity, recent CNN models are increasingly adopting computationally lightweight implementation. Motivated by the potential for the software stack to enable CNN models on DRAM-PIM hardware without invasive changes, we propose PIMFlow, an end-to-end compiler and runtime support, to accelerate CNN models on a PIM-enabled GPU memory. PIMFlow transforms model graphs to create inter-node parallelism across GPU and PIM, explores possible task- and data-parallel execution scenarios for optimal execution time, and provides a code-generating back-end and execution engine for DRAM-PIM. PIMFlow achieves up to 82% end-to-end speedup and reduces energy consumption by 26% on average for CNN model inferences.

References

[1]
2019. Open Neural Network Exchange. https://onnx.ai/
[2]
2022. PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM. https://doi.org/10.5281/zenodo.7639153
[3]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 336–348. https://doi.org/10.1145/2749469.2750385
[4]
Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM Trans. Archit. Code Optim., 14, 2 (2017), Article 14, jun, 25 pages. issn:1544-3566 https://doi.org/10.1145/3085572
[5]
Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu. 2021. Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). 159–172. https://doi.org/10.1109/PACT52795.2021.00019
[6]
Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu. 2018. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA. 316–331. isbn:9781450349116 https://doi.org/10.1145/3173162.3173177
[7]
E. O. Brigham and R. E. Morrow. 1967. The fast Fourier transform. IEEE Spectrum, 4, 12 (1967), 63–70. https://doi.org/10.1109/MSPEC.1967.5217220
[8]
Louis-Claude Canon, Loris Marchal, Bertrand Simon, and Frédéric Vivien. 2020. Online Scheduling of Task Graphs on Heterogeneous Platforms. IEEE Transactions on Parallel and Distributed Systems, 31, 3 (2020), 721–732. https://doi.org/10.1109/TPDS.2019.2942909
[9]
Kevin K. Chang. 2017. Understanding and Improving the Latency of DRAM-Based Memory Systems. https://doi.org/10.48550/ARXIV.1712.08304
[10]
Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. 2016. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. SIGMETRICS Perform. Eval. Rev., 44, 1 (2016), jun, 323–336. issn:0163-5999 https://doi.org/10.1145/2964791.2901453
[11]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. isbn:9781931971478
[12]
Zhi Chen, Cody Hao Yu, Trevor Morris, Jorn Tuyls, Yi-Hsiang Lai, Jared Roesch, Elliott Delaye, Vin Sharma, and Yida Wang. 2021. Bring Your Own Codegen to Deep Learning Compiler. CoRR, abs/2105.03215 (2021).
[13]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.
[14]
Benjamin Y. Cho, Jeageun Jung, and Mattan Erez. 2021. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21). Association for Computing Machinery, New York, NY, USA. Article 44, 14 pages. isbn:9781450384421
[15]
Seunghwan Cho, Haerang Choi, Eunhyeok Park, Hyunsung Shin, and Sungjoo Yoo. 2020. McDRAM v2: In-Dynamic Random Access Memory Systolic Array Accelerator to Address the Large Model Problem in Deep Neural Networks on the Edge. IEEE Access, https://doi.org/10.1109/ACCESS.2020.3011265
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arXiv:1810.04805. arxiv:1810.04805
[17]
Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The Architecture of the DIVA Processing-in-Memory Chip. isbn:1581134835 https://doi.org/10.1145/514191.514197
[18]
M. Gokhale, B. Holmes, and K. Iobst. 1995. Processing in memory: the Terasys massively parallel PIM array. Computer, https://doi.org/10.1109/2.375174
[19]
Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.
[20]
Peng Gu, Xinfeng Xie, Shuangchen Li, Dimin Niu, Hongzhong Zheng, Krishna T Malladi, and Yuan Xie. 2020. DLUX: A LUT-based near-bank accelerator for data center deep learning training workloads. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 40, 8 (2020), 1586–1599.
[21]
Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience, 24, 1 (2012), 62–72.
[22]
Ramyad Hadidi, Lifeng Nai, Hyojong Kim, and Hyesoon Kim. 2017. CAIRO: A compiler-assisted technique for enabling instruction-level offloading of processing-in-memory. ACM Transactions on Architecture and Code Optimization (TACO), 14, 4 (2017), 1–25.
[23]
Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. 2021. SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA. 329–345. isbn:9781450383172 https://doi.org/10.1145/3445814.3446749
[24]
Hyungkyu Ham, Hyunuk Cho, Minjae Kim, Jueon Park, Jeongmin Hong, Hyojin Sung, Eunhyeok Park, Euicheol Lim, and Gwangsun Kim. 2021. Near-Data Processing in Memory Expander for DNN Acceleration on GPUs. IEEE Computer Architecture Letters, 20, 2 (2021), 171–174. https://doi.org/10.1109/LCA.2021.3126450
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. https://doi.org/10.1109/CVPR.2016.90
[26]
Mingxuan He, Choungki Song, Ilkon Kim, Chunseok Jeong, Seho Kim, Il Park, Mithuna Thottethodi, and T. N. Vijaykumar. 2020. Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning. In Proc. ACM/IEEE 48th Annu. Int. Symp. Microarchit. https://doi.org/10.1109/MICRO50266.2020.00040
[27]
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, 204–216. isbn:9781467389471 https://doi.org/10.1109/ISCA.2016.27
[28]
Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing. 2017. Maskrnn: Instance level video object segmentation. Advances in neural information processing systems, 30 (2017).
[29]
Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Chunling Hu, Brian T. Lewis, and Keshav Pingali. 2014. Adaptive heterogeneous scheduling for integrated GPUs. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT). 151–162. https://doi.org/10.1145/2628071.2628088
[30]
Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21). Association for Computing Machinery, New York, NY, USA. 738–753. isbn:9781450385572 https://doi.org/10.1145/3466752.3480063
[31]
Andrew Kerr, Haicheng Wu, Manish Gupta, Dustyn Blasig, Pradeep Ramini, Duane Merrill, Aniket Shivam, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Matt Nicely. 2022. CUTLASS. https://github.com/NVIDIA/cutlass
[32]
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In ISCA. https://doi.org/10.1109/ISCA45697.2020.00047
[33]
Gwangsun Kim, Niladrish Chatterjee, Mike O’Connor, and Kevin Hsieh. 2017. Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). Association for Computing Machinery, New York, NY, USA. Article 24, 12 pages. isbn:9781450351140 https://doi.org/10.1145/3126908.3126965
[34]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Comput. Archit. Lett., https://doi.org/10.1109/LCA.2015.2414456
[35]
Andrew Lavin and Scott Gray. 2015. Fast Algorithms for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.1509.09308
[36]
Sunjung Lee, Jaewan Choi, Wonkyung Jung, Byeongho Kim, Jaehyun Park, Hweesoo Kim, and Jung Ho Ahn. 2022. MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units. ACM Trans. Des. Autom. Electron. Syst., 27, 5 (2022), Article 42, jun, 25 pages. issn:1084-4309 https://doi.org/10.1145/3497745
[37]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product. In ISCA. https://doi.org/10.1109/ISCA52012.2021.00013
[38]
Seongju Lee, Kyuyoung Kim, Sanghoon Oh, Joonhong Park, Gimoon Hong, Dongyoon Ka, Kyudong Hwang, Jeongje Park, Kyeongpil Kang, Jungyeon Kim, Junyeol Jeon, Nahsung Kim, Yongkee Kwon, Kornijcuk Vladimir, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Jaewook Lee, Donguc Ko, Younggun Jun, Keewon Cho, Ilwoong Kim, Choungki Song, Chunseok Jeong, Daehan Kwon, Jieun Jang, Il Park, Junhyun Chun, and Joohwan Cho. 2022. A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications. In Proc. IEEE Int. Solid-State Circuits Conf.
[39]
Zhe Ma, Francky Catthoor, and Johan Vounckx. 2005. Hierarchical Task Scheduler for Interleaving Subtasks on Heterogeneous Multiprocessor Platforms. In Proceedings of the 2005 Asia and South Pacific Design Automation Conference (ASP-DAC ’05). Association for Computing Machinery, New York, NY, USA. 952–955. isbn:0780387376 https://doi.org/10.1145/1120725.1120765
[40]
Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the Machine-Vision Package of Torch. In Proceedings of the 18th ACM International Conference on Multimedia (MM ’10). Association for Computing Machinery, New York, NY, USA. 1485–1488. isbn:9781605589336 https://doi.org/10.1145/1873951.1874254
[41]
Narasinga Rao Miniskar, Frank Liu, Aaron R. Young, Dwaipayan Chakraborty, and Jeffrey S. Vetter. 2021. A Hierarchical Task Scheduler for Heterogeneous Computing. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg. 57–76. isbn:978-3-030-78712-7 https://doi.org/10.1007/978-3-030-78713-4_4
[42]
R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59, 2/3 (2015), 17:1–17:14. https://doi.org/10.1147/JRD.2015.2409732
[43]
R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O’Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59, 2/3 (2015), 17:1–17:14. https://doi.org/10.1147/JRD.2015.2409732
[44]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. CoRR, arXiv:1906.00091. arxiv:1906.00091
[45]
Ataberk Olgun, Juan Gómez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oğuz Ergin, and Onur Mutlu. 2021. PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM. arxiv:2111.00082.
[46]
Michael Orr and Oliver Sinnen. 2021. Optimal task scheduling for partially heterogeneous systems. Parallel Comput., 107 (2021), 102815. issn:0167-8191 https://doi.org/10.1016/j.parco.2021.102815
[47]
Alexandros Papakonstantinou, Yun Liang, John A. Stratton, Karthik Gururaj, Deming Chen, Wen-Mei W. Hwu, and Jason Cong. 2011. Multilevel Granularity Parallelism Synthesis on FPGAs. In 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. 178–185. https://doi.org/10.1109/FCCM.2011.29
[48]
Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). 31–44. https://doi.org/10.1145/2967938.2967940
[49]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.
[50]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. 234–241.
[51]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In CVPR. https://doi.org/10.1109/CVPR.2018.00474
[52]
Hyunsung Shin, Dongyoung Kim, Eunhyeok Park, Sungho Park, Yongsik Park, and Sungjoo Yoo. 2018. McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., https://doi.org/10.1109/TCAD.2018.2857044
[53]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556
[54]
Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. 2010. Maestro: Data Orchestration and Tuning for OpenCL Devices. In Euro-Par 2010 - Parallel Processing, Pasqua D’Ambra, Mario Guarracino, and Domenico Talia (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 275–286. isbn:978-3-642-15291-7
[55]
Jaspar Subhlok and Gary Vondran. 1996. Optimal Latency-Throughput Tradeoffs for Data Parallel Pipelines. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA ’96). Association for Computing Machinery, New York, NY, USA. 62–71. isbn:0897918096 https://doi.org/10.1145/237502.237508
[56]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR. https://doi.org/10.1109/CVPR.2019.00293
[57]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. CoRR, arXiv:1905.11946. arxiv:1905.11946
[58]
Yizhou Wei, Minxuan Zhou, Sihang Liu, Korakit Seemakhupt, Tajana Rosing, and Samira Khan. 2022. PIMProf: An Automated Program Profiler for Processing-in-Memory Offloading Decisions. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). 855–860. https://doi.org/10.23919/DATE54114.2022.9774560
[59]
Yudong Wu, Mingyao Shen, Yi-Hui Chen, and Yuanyuan Zhou. 2020. Tuning applications for efficient GPU offloading to in-memory processing. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.
[60]
Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 570–583.
[61]
Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.). 4, 204–216. https://proceedings.mlsys.org/paper/2022/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf
[62]
Liang Yan, Mingzhe Zhang, Rujia Wang, Xiaoming Chen, Xingqi Zou, Xiaoyang Lu, Yinhe Han, and Xian-He Sun. 2021. CoPIM: A Concurrency-aware PIM Workload Offloading Architecture for Graph Applications. In 2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 1–6. https://doi.org/10.1109/ISLPED52811.2021.9502483
[63]
Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. Selective Replication in Memory-Side GPU Caches. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 967–980. https://doi.org/10.1109/MICRO50266.2020.00082

Cited By

View all
  • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '23: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization
February 2023
262 pages
ISBN:9798400701016
DOI:10.1145/3579990
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. CNN models
  2. Processing-in-memory

Qualifiers

  • Research-article

Conference

CGO '23

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,082
  • Downloads (Last 6 weeks)75
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media