[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Scalability Limitations of Processing-in-Memory using Real System Evaluations

Published: 21 February 2024 Publication History

Abstract

Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.

References

[1]
B. Acun, M. Murphy, X. Wang, J. Nie, C. Wu, and K. Hazelwood. 2021. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 802--814. https://doi.org/10.1109/HPCA51647.2021.00072
[2]
Rashmi Agrawal, Leo de Castro, Guowei Yang, Chiraag Juvekar, Rabia Yazicigil, Anantha Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi. 2023. FAB: An fpga-based accelerator for bootstrappable fully homomorphic encryption. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 882--895. https://doi.org/10.1109/HPCA56546.2023.10070953
[3]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 105--117. https://doi.org/10.1145/2749469.2750386
[4]
Ahmad Al Badawi, Jack Bates, Flavio Bergamaschi, David Bruce Cousins, Saroja Erabelli, Nicholas Genise, Shai Halevi, Hamish Hunt, Andrey Kim, Yongwoo Lee, Zeyu Liu, Daniele Micciancio, Ian Quah, Yuriy Polyakov, Saraswathy R.V., Kurt Rohloff, Jonathan Saylor, Dmitriy Suponitsky, Matthew Triplett, Vinod Vaikuntanathan, and Vincent Zucca. 2022. OpenFHE: Open-source fully homomorphic encryption library. In Proceedings of the 10th Workshop on Encrypted Computing & Applied Homomorphic Cryptography (Los Angeles, CA, USA) (WAHC'22). Association for Computing Machinery, New York, NY, USA, 53--63. https://doi.org/10.1145/3560827.3563379
[5]
Mohammad Alian, Seung Won Min, Hadi Asgharimoghaddam, Ashutosh Dhar, Dong Kai Wang, Thomas Roewer, Adam McPadden, Oliver O'Halloran, Deming Chen, Jinjun Xiong, Daehoon Kim, Wen-mei Hwu, and Nam Sung Kim. 2018. Application-transparent near-memory processing architecture with memory channel network. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 802--814. https://doi.org/10.1109/MICRO.2018.00070
[6]
Michael J. Anderson, Benny Chen, Stephen Chen, Summer Deng, Jordan Fix, Michael Gschwind, Aravind Kalaiah, Changkyu Kim, Jaewon Lee, Jason Liang, Haixin Liu, Yinghai Lu, Jack Montgomery, Arun Moorthy, Nadathur Satish, Sam Naghshineh, Avinash Nayak, Jongsoo Park, Chris Petersen, Martin Schatz, Narayanan Sundaram, Bangsheng Tang, Peter Tang, Amy Yang, Jiecao Yu, Hector Yuen, Ying Zhang, Aravind Anbudurai, Vandana Balan, Harsha Bojja, Joe Boyd, Matthew Breitbach, Claudio Caldato, Anna Calvo, Garret Catron, Sneh Chandwani, Panos Christeas, Brad Cottel, Brian Coutinho, Arun Dalli, Abhishek Dhanotia, Oniel Duncan, Roman Dzhabarov, Simon Elmir, Chunli Fu, Wenyin Fu, Michael Fulthorp, Adi Gangidi, Nick Gibson, Sean Gordon, Beatriz Padilla Hernandez, Daniel Ho, Yu-Cheng Huang, Olof Johansson, Shishir Juluri, and et al. 2021. First-generation inference accelerator deployment at facebook. CoRR, Vol. abs/2107.04140 (2021). showeprint[arXiv]2107.04140 https://arxiv.org/abs/2107.04140
[7]
Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13. https://doi.org/10.1109/MICRO.2016.7783753
[8]
D. H. Bailey. 1989. FFTs in external or hierarchical memory. In Supercomputing '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. 234--242. https://doi.org/10.1145/76263.76288
[9]
Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory (TOCT), Vol. 6, 3 (2014), 1--36.
[10]
Sébastien Cayrols, Jiali Li, George Bosilca, Stanimire Tomov, Alan Ayala, and Jack Dongarra. 2022. Lossy all-to-all exchange for accelerating parallel 3-d ffts on hybrid architectures with gpus. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 152--160. https://doi.org/10.1109/CLUSTER51413.2022.00029
[11]
Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. 2017. Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security. Springer, 409--437.
[12]
Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020. TFHE: Fast fully homomorphic encryption over the torus. Journal of Cryptology, Vol. 33, 1 (2020), 34--91.
[13]
Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 tensor core gpu: Performance and innovation. IEEE Micro, Vol. 41, 2 (2021), 29--35.
[14]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys '16). Association for Computing Machinery, New York, NY, USA, 191--198. https://doi.org/10.1145/2959100.2959190
[15]
Richard Crandall and Barry Fagin. 1994. Discrete weighted transforms and large-integer arithmetic. Math. Comput., Vol. 62 (1994), 305--324. https://doi.org/10.2307/2153411
[16]
David Culler, Jaswinder Pal Singh, and Anoop Gupta. 1998. Parallel computer architecture: A hardware/software approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[17]
Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-mei Hwu. 2019. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing. 46--57.
[18]
William Dally and Brian Towles. 2004. Principles and practices of interconnection network. (01 2004).
[19]
William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolution of the graphics processing unit (gpu). IEEE Micro, Vol. 41, 6 (2021), 42--51. https://doi.org/10.1109/MM.2021.3113475
[20]
W. J. Dally and B. Towles. 2001. Route packets, not wires: On-chip interconnection networks (DAC '01). ACM, New York, NY, USA, 684--689. https://doi.org/10.1145/378239.379048
[21]
William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-specific hardware accelerators. Commun. ACM, Vol. 63, 7 (jun 2020), 48--57. https://doi.org/10.1145/3361682
[22]
Fabrice Devaux. 2019. The true processing in memory accelerator. In 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18--20, 2019. IEEE, 1--24. https://doi.org/10.1109/HOTCHIPS.2019.8875680
[23]
Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using non-volatile memory for storing deep learning models. arxiv: 1811.05922 [cs.LG]
[24]
Junfeng Fan and Frederik Vercauteren. 2012. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch., Vol. 2012 (2012), 144.
[25]
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 283--295. https://doi.org/10.1109/HPCA.2015.7056040
[26]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 113--124. https://doi.org/10.1109/PACT.2015.22
[27]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 751--764. https://doi.org/10.1145/3037697.3037702
[28]
S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu. 2019. Processing-in-memory: A workload-driven perspective. IBM Journal of Research and Development, Vol. 63, 6 (2019), 3:1--3:19. https://doi.org/10.1147/JRD.2019.2934048
[29]
Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. 2022. SparseP: Towards efficient sparse matrix vector multiplication on real processing-in-memory architectures. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, 1, Article 21 (feb 2022), bibinfonumpages49 pages. https://doi.org/10.1145/3508041
[30]
Steven Gottlieb, W. Liu, D. Toussaint, R. L. Renken, and R. L. Sugar. 1987. Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. Phys. Rev. D, Vol. 35 (Apr 1987), 2531--2542. Issue 8. https://doi.org/10.1103/PhysRevD.35.2531
[31]
Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen. 2019. Post-training 4-bit quantization on embedding tables. CoRR, Vol. abs/1911.02079 (2019). showeprint[arXiv]1911.02079 http://arxiv.org/abs/1911.02079
[32]
Saransh Gupta, Rosario Cammarota, and Tajana Rosing. 2022. MemFHE: End-to-end computing with fully homomorphic encryption in memory. arXiv preprint arXiv:2204.12557 (2022).
[33]
Saransh Gupta and Tajana vS imunić Rosing. 2021. Accelerating fully homomorphic encryption with processing in memory. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 1335--1338.
[34]
Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. The architectural implications of facebook's dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 488--501. https://doi.org/10.1109/HPCA47549.2020.00047
[35]
Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system. IEEE Access, Vol. 10 (2022), 52565--52608. https://doi.org/10.1109/ACCESS.2022.3174101
[36]
Kyoohyung Han and Dohyeong Ki. 2020. Better bootstrapping for approximate homomorphic encryption. In Cryptographers' Track at the RSA Conference. Springer, 364--390.
[37]
John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM, Vol. 62, 2 (jan 2019), 48--60. https://doi.org/10.1145/3282307
[38]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018).
[39]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News, Vol. 45, 2 (jun 2017), 1--12. https://doi.org/10.1145/3140659.3080246
[40]
Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee. 2021. Over 100x faster bootstrapping in fully homomorphic encryption through memory-centric optimization with gpus. IACR Transactions on Cryptographic Hardware and Embedded Systems (2021), 114--148.
[41]
Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, Xiaodong Wang, Brandon Reagen, Carole-Jean Wu, Mark Hempstead, and Xuan Zhang. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 790--803. https://doi.org/10.1109/ISCA45697.2020.00070
[42]
Liu Ke, Xuan Zhang, Jinin So, Jong-Geon Lee, Shin-Haeng Kang, Sukhan Lee, Songyi Han, YeonGon Cho, Jin Hyun Kim, Yongsuk Kwon, KyungSoo Kim, Jin Jung, Ilkwon Yun, Sung Joo Park, Hyunsun Park, Joonho Song, Jeonghyeon Cho, Kyomin Sohn, Nam Sung Kim, and Hsien-Hsin S. Lee. 2022. Near-memory processing in action: Accelerating personalized recommendation with axdimm. IEEE Micro, Vol. 42, 1 (2022), 116--127. https://doi.org/10.1109/MM.2021.3097700
[43]
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 145--155. https://doi.org/10.1109/PACT.2013.6618812
[44]
John Kim. 2009. Low-cost router microarchitecture for on-chip networks. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 255--266. https://doi.org/10.1145/1669112.1669145
[45]
Jin Hyun Kim, Shin-Haeng Kang, Sukhan Lee, Hyeonsu Kim, Yuhwan Ro, Seungwon Lee, David Wang, Jihyun Choi, Jinin So, YeonGon Cho, JoonHo Song, Jeonghyeon Cho, Kyomin Sohn, and Nam Sung Kim. 2022a. Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer. IEEE Micro, Vol. 42, 3 (2022), 20--30. https://doi.org/10.1109/MM.2022.3164651
[46]
Jin Hyun Kim, Shin-haeng Kang, Sukhan Lee, Hyeonsu Kim, Woongjae Song, Yuhwan Ro, Seungwon Lee, David Wang, Hyunsung Shin, Bengseng Phuah, Jihyun Choi, Jinin So, YeonGon Cho, JoonHo Song, Jangseok Choi, Jeonghyeon Cho, Kyomin Sohn, Youngsoo Sohn, Kwangil Park, and Nam Sung Kim. 2021. Aquabolt-xl: Samsung hbm2-pim with in-memory processing for ml accelerators and beyond. In 2021 IEEE Hot Chips 33 Symposium (HCS). 1--26. https://doi.org/10.1109/HCS52781.2021.9567191
[47]
Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim, Minsoo Rhu, and Jung Ho Ahn. 2022b. Bts: An accelerator for bootstrappable fully homomorphic encryption. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 711--725.
[48]
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, and Angshuman Parashar. 2020. MAESTRO: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE Micro, Vol. 40, 3 (2020), 20--29. https://doi.org/10.1109/MM.2020.2985963
[49]
Yongkee Kwon, Guhyun Kim, Nahsung Kim, Woojae Shin, Jongsoon Won, Hyunha Joo, Haerang Choi, Byeongju An, Gyeongcheol Shin, Dayeon Yun, Jeongbin Kim, Changhyun Kim, Ilkon Kim, Jaehan Park, Chanwook Park, Yosub Song, Byeongsu Yang, Hyeongdeok Lee, Seungyeong Park, Wonjun Lee, Seongju Lee, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, John Kim, Euicheol Lim, and Junhyun Chun. 2023. Memory-Centric Computing with SK Hynix's Domain-Specific Memory. In 2023 IEEE Hot Chips 35 Symposium (HCS). 1--26. https://doi.org/10.1109/HCS59251.2023.10254717
[50]
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). Association for Computing Machinery, New York, NY, USA, 740--753. https://doi.org/10.1145/3352460.3358284
[51]
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2021a. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 235--248. https://doi.org/10.1109/HPCA51647.2021.00029
[52]
Youngeun Kwon and Minsoo Rhu. 2022. Training personalized recommendation systems from (gpu) Scratch: Look forward not backwards. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 860--873. https://doi.org/10.1145/3470496.3527386
[53]
Young-Cheon Kwon, Jaehoon Lee, Suk Han fhand Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim, Youngmin Cho, Jin Guk Kim, Jongyoon Choi, Hyun-Sung Shin, Jin Kim, BengSeng Phuah, HyoungMin Kim, Myeong Jun Song, Ahn Choi, Daeho Kim, SooYoung Kim, Eun-Bong Kim, David Wang, Shinhaeng Kang, Yuhwan Ro, Seungwoo Seo, JoonHo Song, Jaeyoun Youn, Kyomin Sohn, and Nam Sung Kim. 2021b. 25.4 A 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2tflops programmable computing unit using bank-level parallelism, for machine learning applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), Vol. 64. 350--352. https://doi.org/10.1109/ISSCC42613.2021.9365862
[54]
Dominique Lavenier, Jean-Francois Roy, and David Furodet. 2016. DNA mapping using processor-in-memory architecture. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 1429--1435.
[55]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021a. Hardware architecture and software stack for pim based on commercial dram technology : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 43--56. https://doi.org/10.1109/ISCA52012.2021.00013
[56]
Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W. Lee, and Tae Jun Ham. 2021b. MERCI: Efficient embedding reduction on commodity hardware via sub-query memoization. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 302--313. https://doi.org/10.1145/3445814.3446717
[57]
Vadim Lyubashevsky, Chris Peikert, and Oded Regev. 2010. On ideal lattices and learning with errors over rings. In Advances in Cryptology -- EUROCRYPT 2010, Henri Gilbert (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--23.
[58]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 993--1011. https://doi.org/10.1145/3470496.3533727
[59]
Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2019. Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems, Vol. 67 (2019), 28--41.
[60]
Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 457--468. https://doi.org/10.1109/HPCA.2017.54
[61]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. https://doi.org/10.48550/ARXIV.1906.00091
[62]
Hamid Nejatollahi, Saransh Gupta, Mohsen Imani, Tajana Simunic Rosing, Rosario Cammarota, and Nikil Dutt. 2020. Cryptopim: In-memory acceleration for lattice-based cryptographic hardware. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.
[63]
Joel Nider, Craig Mustard, Andrada Zoltan, John Ramsden, Larry Liu, Jacob Grossbard, Mohammad Dashti, Romaric Jodin, Alexandre Ghiti, Jordi Chauzi, et al. 2021. A case study of processing-in-memory in off-the-shelf systems. In USENIX Annual Technical Conference. 117--130.
[64]
NVidia. 2012. cuBLAS library. https://docs.nvidia.com/cuda/cublas/
[65]
Jaehyun Park, Byeongho Kim, Sungmin Yun, Eojin Lee, Minsoo Rhu, and Jung Ho Ahn. 2021. TRiM: Enhancing processor-memory interfaces with scalable tensor reduction in memory. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO '21). Association for Computing Machinery, New York, NY, USA, 268--281. https://doi.org/10.1145/3466752.3480080
[66]
Thomas Pöppelmann, Tobias Oder, and Tim Güneysu. 2015. High-performance ideal lattice-based cryptography on 8-bit atxmega microcontrollers. In Progress in Cryptology -- LATINCRYPT 2015, Kristin Lauter and Francisco Rodríguez-Henríquez (Eds.). Springer International Publishing, Cham, 346--365.
[67]
Dayane Reis, Jonathan Takeshita, Taeho Jung, Michael Niemier, and Xiaobo Sharon Hu. 2020. Computing-in-memory for performance and energy-efficient homomorphic encryption. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 28, 11 (2020), 2300--2313.
[68]
Joshua Romero, Junqi Yin, Nouamane Laanait, Bing Xie, Todd Young, Sean Treichler, Vitalii Starchenko, Albina Borisevich, Alexander Sergeev, and Michael Matheson. 2022. Accelerating collective communication in data parallel training across deep learning frameworks. (4 2022). https://www.osti.gov/biblio/1862153
[69]
Sujoy Sinha Roy, Frederik Vercauteren, Nele Mentens, Donald Donglong Chen, and Ingrid Verbauwhede. 2014. Compact ring-lwe cryptoprocessor. In International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 371--391.
[70]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A fast and programmable accelerator for fully homomorphic encryption. In MICRO-54: 54th Annu. IEEE/ACM Int. Symp. on Microarchitecture (Virtual Event, Greece) (MICRO '21). ACM, New York, NY, USA, 238--252. https://doi.org/10.1145/3466752.3480070
[71]
UPMEM SAS. 2021a. The pim reference platform. https://www.upmem.com/technology/ Retrieved October 31, 2022 from
[72]
UPMEM SAS. 2021b. UPMEM documentation. https://sdk.upmem.com/2021.3.0/ Retrieved October 31, 2022 from
[73]
Brian C. Schwedock, Piratach Yoovidhya, Jennifer Seibert, and Nathan Beckmann. 2022. T"ak=o: A polymorphic cache hierarchy for general-purpose optimization of data movement. In Proceedings of the 49th Annual International Symposium on Computer Architecture (New York, New York) (ISCA '22). Association for Computing Machinery, New York, NY, USA, 42--58. https://doi.org/10.1145/3470496.3527379
[74]
Harold S. Stone. 1970. A logic-in-memory computer. IEEE Trans. Comput., Vol. C-19, 1 (1970), 73--78. https://doi.org/10.1109/TC.1970.5008902
[75]
Weiyi Sun, Zhaoshi Li, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. ABC-DIMM: Alleviating the bottleneck of communication in dimm-based near-memory processing with inter-dimm broadcast. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 237--250. https://doi.org/10.1109/ISCA52012.2021.00027
[76]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE, Vol. 105, 12 (2017), 2295--2329. https://doi.org/10.1109/JPROC.2017.2761740
[77]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in mpich. Int. J. High Perform. Comput. Appl., Vol. 19, 1 (feb 2005), 49--66. https://doi.org/10.1177/1094342005051521
[78]
Zheng Wang, Yuke Wang, Boyuan Feng, Dheevatsa Mudigere, Bharath Muthiah, and Yufei Ding. 2022. EL-rec: Efficient large-scale recommendation model training via tensor-train embedding table. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14. https://doi.org/10.1109/SC41404.2022.00075
[79]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. https://doi.org/10.48550/ARXIV.1609.08144
[80]
Neng Zhang, Bohan Yang, Chen Chen, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2020. Highly efficient architecture of newhope-nist on fpga using low-complexity ntt/intt. IACR Transactions on Cryptographic Hardware and Embedded Systems (2020), 49--72.
[81]
Huasha Zhao and John Canny. 2013. Sparse allreduce: Efficient scalable communication for power-law data. arxiv: 1312.3020 [cs.DC]
[82]
Huasha Zhao and John Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In 2014 43rd International Conference on Parallel Processing. 273--282. https://doi.org/10.1109/ICPP.2014.36
[83]
Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. 43--51.
[84]
Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059--1068.
[85]
Z. Zhou, C. Li, F. Yang, and G. Sun. 2023. DIMM-Link: Enabling efficient inter-dimm communication for near-memory processing. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 302--316. https://doi.org/10.1109/HPCA56546.2023.10071005

Cited By

View all
  • (2024)PhD Forum: Efficient Privacy-Preserving Processing via Memory-Centric Computing2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00039(322-325)Online publication date: 30-Sep-2024
  • (2024)PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00053(627-642)Online publication date: 2-Nov-2024
  • (2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
  • Show More Cited By

Index Terms

  1. Scalability Limitations of Processing-in-Memory using Real System Evaluations

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
    Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 8, Issue 1
    POMACS
    March 2024
    494 pages
    EISSN:2476-1249
    DOI:10.1145/3649331
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 February 2024
    Published in POMACS Volume 8, Issue 1

    Check for updates

    Author Tags

    1. collective communication
    2. interconnection networks
    3. processing-in-memory

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,156
    • Downloads (Last 6 weeks)138
    Reflects downloads up to 14 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PhD Forum: Efficient Privacy-Preserving Processing via Memory-Centric Computing2024 43rd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS64841.2024.00039(322-325)Online publication date: 30-Sep-2024
    • (2024)PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00053(627-642)Online publication date: 2-Nov-2024
    • (2024)SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00029(217-229)Online publication date: 5-May-2024
    • (2024)NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00073(946-960)Online publication date: 29-Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media