[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3410463.3414643acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

RackMem: A Tailored Caching Layer for Rack Scale Computing

Published: 30 September 2020 Publication History

Abstract

High-performance computing (HPC) clusters suffer from an overall low memory utilization that is caused by the node-centric memory allocation combined with the variable memory requirements of HPC workloads. The recent provisioning of nodes with terabytes of memory to accommodate workloads with extreme peak memory requirements further exacerbates the problem. Memory disaggregation is viewed as a promising remedy to increase overall resource utilization and enable cost-effective up-scaling and efficient operation of HPC clusters, however, the overhead of demand paging in virtual memory management has so far hindered performant implementations. To overcome these limitations, this work presents RackMem, an efficient implementation of disaggregated memory for rack scale computing. RackMem addresses the shortcomings of Linux's demand paging algorithm and automatically adapts to the memory access patterns of individual processes to minimize the inherent overhead of remote memory accesses. Evaluated on a cluster with an InfiniBand interconnect, RackMem outperforms the state-of-the-art RDMA implementation and Linux's virtual memory paging by a significant margin. RackMem's custom demand paging implementation achieves a tail latency that is two orders of magnitude better than that of the Linux kernel. Compared to the state-of-the-art remote paging solution, RackMem achieves a 28% higher throughput and a 44% lower tail latency for a wide variety of real-world workloads.

References

[1]
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: a simple abstraction for remote memory. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, 775--787. https://www.usenix.org/conference/atc18/presentation/aguilera
[2]
Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2017. Remote Memory in the Age of Fast Networks. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, 121--127. https://doi.org/10.1145/3127479.3131612
[3]
Alibaba. 2018. Alibaba Production Cluster Trace Data. https://github.com/alibaba/clusterdata.
[4]
David H Bailey, Eric Barszcz, John T Barton, David S Browning, Russell L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73.
[5]
Jeff Barr. 2019. EC2 High Memory Update -- New 18 TB and 24 TB Instance. https://aws.amazon.com/ko/blogs/aws/ec2-high-memory-update-new18-tb-and-24-tb-instances/.
[6]
M. Bielski, Ilias Syrigos, Kostas Katrinis, Dimitris Syrivelis, Andrea Reale, Dimitris Theodoropoulos, Nikolaos Alachiotis, Dionisios Pnevmatikatos, H. E. Pap, George Zervas, Vaibhawa Mishra, A. Saljoghei, A. Rigo, J. F. Zazo, Sergio Lopez-Buedo, Marti Torrents, Ferad Zyulkyarov, M. Enrico, and O. G. de Dios. 2018. dReDBox: Materializing a full-stack rack-scale system prototype of a next-generation disaggregated datacenter. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 1093--1098. https://doi.org/10.23919/DATE.2018.8342174
[7]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[8]
Brad Calder and Bart Sano. 2019. Introducing Compute- and Memory-Optimized VMs for Google Compute Engine. https://cloud.google.com/blog/products/compute/introducing-compute-and-memory-optimized-vms-for-googlecompute-engine.
[9]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, 127--144. https://doi.org/10.1145/2541940.2541941
[10]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe CudreMauroux. 2013. OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases. Proc. VLDB Endow. 7, 4 (Dec. 2013), 277--288. https://doi.org/10.14778/2732240.2732246
[11]
Paolo Faraboschi, Kimberly Keeton, Tim Marsland, and Dejan Milojicic. 2015. Beyond Processor-centric Operating Systems. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV). USENIX Association. https://www.usenix.org/conference/hotos15/workshop-program/presentation/faraboschi
[12]
Linux Foundation. 2019. mm, swap: use rbtree for swap extent. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ ?id=4efaceb1c5f8136d5fec3f26549d294b8e898bd7.
[13]
Linux Foundation. 2020. cgroups(7) - Linux manual page. http://man7.org/linux/man-pages/man7/cgroups.7.html.
[14]
Linux Foundation. 2020. Linux kernel documentation. https://www.kernel.org/doc/Documentation/trace/tracepoints.txt.
[15]
Linux Foundation. 2020. Null block device driver. https://www.kernel.org/doc/html/latest/block/null_blk.html.
[16]
Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network Requirements for Resource Disaggregation. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI '16). USENIX Association, 249--264. http://dl.acm.org/citation.cfm?id=3026877.3026897
[17]
Google. 2011. Google Production Cluster Trace Data. https://github.com/google/cluster-data.
[18]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. 2017. Efficient Memory Disaggregation with Infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, 649--667. https://www.usenix.org/conference/nsdi17/technicalsessions/presentation/gu
[19]
Infiniswap 2017. Infiniswap: Efficient Memory Disaggregation with Infiniswap. https://github.com/SymbioticLab/infiniswap.
[20]
Intel. 2018. Intel Rack Scale Design Architecture. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/rack-scale-designarchitecture-white-paper.pdf.
[21]
Adam Jacobs. 2009. The pathologies of big data. Commun. ACM 52, 8 (2009), 36--44.
[22]
Changyeon Jo, Hyunik Kim, and Bernhard Egger. 2020. Instant Virtual Machine Live Migration. In Economics of Grids, Clouds, Systems, and Services (GECON '20). Springer International Publishing, Cham.
[23]
Karthik Kambatla, Giorgos Kollias, Vipin Kumar, and Ananth Grama. 2014. Trends in big data analytics. J. Parallel and Distrib. Comput. 74, 7 (2014), 2561 -- 2573. https://doi.org/10.1016/j.jpdc.2014.01.003 Special Issue on Perspectives on Parallel and Distributed Processing.
[24]
Kostas Katrinis, Dimitris Syrivelis, Dionisios Pnevmatikatos, George Zervas, Dimitris Theodoropoulos, Iordanis Koutsopoulos, K. Hasharoni, Daniel Raho, Christian Pinto, F. Espina, Sergio Lopez-Buedo, Q. Chen, Mario D. Nemirovsky, Damian Roca, H. Klos, and T. Berends. 2016. Rack-scale disaggregated cloud data centers: The dReDBox project vision. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 690--695.
[25]
Aniraj Kesavan, Robert Ricci, and Ryan Stutsman. 2017. To Copy or Not to Copy: Making In-Memory Databases Fast on Modern NICs. In Data Management on New Hardware. Springer International Publishing, 79--94.
[26]
Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale Computers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 317--330. https://doi.org/10.1145/3297858.3304053
[27]
Shuang Liang, Ranjit Noronha, and Dhabaleswar K. Panda. 2005. Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device. In 2005 IEEE International Conference on Cluster Computing. 1--10. https://doi.org/10.1109/CLUSTR.2005.347050
[28]
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09). Association for Computing Machinery, New York, NY, USA, 267--278. https://doi.org/10.1145/1555754.1555789
[29]
Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2012. System-level implications of disaggregated memory. In IEEE International Symposium on High-Performance Comp Architecture. 1--12. https://doi.org/10.1109/HPCA.2012.6168955
[30]
H. Litz, M. Thuermer, and U. Bruening. 2010. TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect. In 2010 IEEE International Conference on Cluster Computing. 9--18. https://doi.org/10.1109/ CLUSTER.2010.37
[31]
Feilong Liu, Lingyan Yin, and Spyros Blanas. 2017. Design and Evaluation of an RDMA-aware Data Shuffling Operator for Parallel Database Systems. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys '17). ACM, 48--63. https://doi.org/10.1145/3064176.3064202
[32]
Yin Lu, Yong Chen, Yu Zhuang, Jialin Liu, and Rajeev Thakur. 2015. Collective input/output under memory constraints. International Journal of High Performance Computing Applications 29, 1 (2015), 21--36. https://doi.org/10.1177/1094342014561696
[33]
Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: an RDMAenabled Distributed Persistent Memory File System. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, 773--785. https://www.usenix.org/conference/atc17/technical-sessions/presentation/lu
[34]
LWN.net. 2016. Making swapping scalable. https://lwn.net/Articles/704478/.
[35]
LWN.net. 2017. mm, swap: VMA based swap readahead]. https://lwn.net/Articles/716296/.
[36]
LWN.net. 2017. The next steps for swap]. https://lwn.net/Articles/717707/.
[37]
Hasan Maruf and Mosharaf Chowdhury. 2019. Effectively Prefetching Remote Memory with Leap. (11 2019).
[38]
Mellanox. 2016. Mellanox Products: ConnectX®-5 Single/Dual-Port Adapter supporting 100Gb/s with VPI. http://www.mellanox.com/page/products_dyn? product_family=258&mtag=connectx_5_vpi_card.
[39]
Mellanox. 2020. Introducing 200G HDR InfiniBand Solutions. https://www.mellanox.com/pdf/whitepapers/WP_Introducing_200G_HDR_InfiniBand_Solutions.pdf.
[40]
Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (Oct 2001), 1094--1104. https://doi.org/10.1109/71.963420
[41]
Tia Newhall, Sean Finney, Kuzman Ganchev, and Michael Spiegel. 2003. Nswap: A Network Swapping Module for Linux Clusters. In Euro-Par 2003 Parallel Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, 1160--1169.
[42]
Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimont. 2018. Welcome to Zombieland: Practical and Energy-efficient Memory Disaggregation in a Datacenter. In Proceedings of the Thirteenth EuroSys Conference (EuroSys '18). ACM, Article 16, 12 pages. https://doi.org/10.1145/3190508.3190537
[43]
pmem.io. 2020. Persistent Memory Programming. http://pmem.io/.
[44]
Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout. 2011. It's Time for Low Latency. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems (HotOS '13). USENIX Association, 11--11. http://dl.acm.org/citation.cfm?id=1991596.1991611
[45]
Iman Sadooghi, Jesus Hernandez Martin, Tonglin Li, Kevin Brandstatter, Ketan Maheshwari, Tiago Pais Pitta De Lacerda Ruivo, Gabriele Garzoglio, Steven Timm, Yong Zhao, and Ioan Raicu. 2017. Understanding the Performance and Potential of Cloud Computing for Scientific Applications. IEEE Transactions on Cloud Computing 5, 2 (2017), 358--371. https://doi.org/10.1109/TCC.2015.2404821
[46]
Corey Sanders. 2018. Why you should bet on Azure for your infrastructure needs, today and in the future. https://azure.microsoft.com/en-us/blog/why-youshould-bet-on-azure-for-your-infrastructure-needs-today-and-in-the-future/.
[47]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 69--87.
[48]
Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed Shared Persistent Memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, 323--337. https://doi.org/10.1145/3127479.3128610
[49]
Avinash Sodani. 2011. Race to exascale: Challenges and opportunities. MICRO 2011 Keynote (2011).
[50]
Petter Svärd, Benoit Hudzia, Johan Tordsson, and Erik Elmroth. 2014. Hecatonchire: Towards Multi-host Virtual Machines by Server Disaggregation. In Euro-Par 2014: Parallel Processing Workshops. Springer International Publishing, 519--529.
[51]
Top500.org. 2020. TOP500 Supercomputer Sites. https://top500.org/.
[52]
Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Support for Datacenter Applications. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). ACM, 306--324. https://doi.org/10.1145/3132747.3132762
[53]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud '10). USENIX Association, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113
[54]
Darko Zivanovic, Milan Pavlovic, Milan Radulovic, Paul M. Carpenter, Petar Radojkovic, Eduard Ayguade, Hyunsung Shin, Jongpil Son, and Sally A. McKee. 2017. Main memory in HPC: Do we need more or could we live with less? ACM Transactions on Architecture and Code Optimization 14, 1 (2017), 1--26. https://doi.org/10.1145/3023362.

Cited By

View all
  • (2023)Dynamic Memory Provisioning on Disaggregated HPC SystemsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624174(973-982)Online publication date: 12-Nov-2023
  • (2023)CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM57271.2023.00017(75-82)Online publication date: May-2023
  • (2023)DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00011(37-48)Online publication date: 31-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
September 2020
505 pages
ISBN:9781450380751
DOI:10.1145/3410463
  • General Chair:
  • Vivek Sarkar,
  • Program Chair:
  • Hyesoon Kim
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high-performance computing
  2. remote memory
  3. resource disaggregation
  4. virtualization

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation of Korea (NRF)
  • BK21 Plus for Pioneers in Innovative Computing (Dept. of Computer Science and Engineering SNU)

Conference

PACT '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)14
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Dynamic Memory Provisioning on Disaggregated HPC SystemsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624174(973-982)Online publication date: 12-Nov-2023
  • (2023)CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM57271.2023.00017(75-82)Online publication date: May-2023
  • (2023)DEHype: Retrofitting Hypervisors for a Resource-Disaggregated Environment2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00011(37-48)Online publication date: 31-Oct-2023
  • (2021)RapidSwap: a Hierarchical Far MemoryEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-030-92916-9_12(143-151)Online publication date: 9-Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media