More Web Proxy on the site http://driver.im/

research-article

PipeDevice: a hardware-software co-design approach to intra-host container communication

Authors:

Yongqiang Xiong,

Chun Jason Xue,

Hong XuAuthors Info & Claims

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

Pages 126 - 139

https://doi.org/10.1145/3555050.3569118

Published: 30 November 2022 Publication History

Abstract

Containers are prevalently adopted due to the deployment and performance advantages over virtual machines. For many containerized data-intensive applications, however, bulky data transfers may pose performance issues. In particular, communication across co-located containers on the same host incurs large overheads in memory copy and the kernel's TCP stack. Existing solutions such as shared-memory networking and RDMA have their own limitations, including insufficient memory isolation and limited scalability.

This paper presents PipeDevice, a new system for low overhead intra-host container communication. PipeDevice follows a hardware-software co-design approach --- it offloads data forwarding entirely onto hardware, which accesses application data in hugepages on the host, thereby eliminating CPU overhead from memory copy and TCP processing. PipeDevice preserves memory isolation and scales well to connections, making it deployable in public clouds. Isolation is achieved by allocating dedicated memory to each connection from hugepages. To achieve high scalability, PipeDevice stores the connection states entirely in host DRAM and manages them in software. Evaluation with a prototype implementation on commodity FPGA shows that for delivering 80 Gbps across containers PipeDevice saves 63.2% CPU compared to kernel TCP stack, and 40.5% over FreeFlow. PipeDevice provides salient benefits to applications. For example, we port baidu-allreduce to PipeDevice and obtain ~2.2× gains in allreduce throughput.

References

[1]

Achieving Fast, Scalable I/O for Virtualized Servers. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/scalable-i-o-virtualized-servers-paper.pdf.

[2]

Amazon web service. https://aws.amazon.com/.

[3]

AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster.

[4]

Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce.

[5]

bpftrace: High-level tracing language for linux systems. https://bpftrace.org/.

[6]

Cilium. https://github.com/cilium/cilium.

[7]

Cloud-Native Network Functions. https://www.cisco.com/c/en/us/solutions/service-provider/industry/cable/cloud-native-network-functions.html.

[8]

Container management in 2021: In-depth guide. https://research.aimultiple.com/container-management/.

[9]

containerd: an industry-standard container runtime with an emphasis on simplicity, robustness and portability. https://containerd.io/.

[10]

Deep learning containers in Google Cloud. https://cloud.google.com/deep-learning-containers.

[11]

Enable Istio proxy sidecar injection in Oracle cloud native environment. https://docs.oracle.com/en/learn/ocne-sidecars/index.html#introduction.

[12]

F-Stack: A high performance userspace stack based on FreeBSD 11.0 stable. http://www.f-stack.org/.

[13]

Fast memcpy with SPDK and Intel I/OAT DMA Engine. https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html.

[14]

FreeFlow TCP. https://github.com/microsoft/Freeflow/tree/tcp.

[15]

Gloo. https://github.com/facebookincubator/gloo.

[16]

Implement mmap() for zero copy receive. https://lwn.net/Articles/752207/.

[17]

Implementing TCP Sockets over RDMA. https://www.openfabrics.org/images/eventpresos/workshops2014/IBUG/presos/Thursday/PDF/09_Sockets-over-rdma.pdf.

[18]

Information about the TCP chimney offload, receive side scaling, and network direct memory access features in Windows server 2008. https://support.microsoft.com/en-us/help/951037/information-about-the-tcp-chimney-offload-receive-side-scaling-and-net.

[19]

Intel Arria 10 product table. https://www.intel.co.id/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.

[20]

Intel C610 Series Chipset Datasheet. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x99-chipset-pch-datasheet.pdf.

[21]

Intel DSA specification. https://www.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html.

[22]

Intel QuickData Technology Software Guide. https://www.intel.com/content/dam/doc/white-paper/quickdata-technology-software-guide-for-linux-paper.pdf.

[23]

IOAT benchmark. https://github.com/spdk/spdk/tree/master/examples/ioat/perf.

[24]

io_uring. https://man.archlinux.org/man/io_uring.7.en.

[25]

Istio. https://istio.io/latest/about/service-mesh/.

[26]

Linkerd architecture. https://linkerd.io/2.11/reference/architecture/.

[27]

Mellanox BlueField-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf.

[28]

Mellanox BlueField DPU DMA Guide. https://docs.nvidia.com/doca/sdk/dma-samples/index.html.

[29]

Microsoft Azure. https://azure.microsoft.com/.

[30]

NCCL. https://github.com/NVIDIA/nccl.

[31]

Open MPI: Open source high performance computing. https://www.open-mpi.org/.

[32]

Perftest. https://github.com/linux-rdma/perftest.

[33]

Run Spark applications with Docker using Amazon EMR 6.x. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html.

[34]

Seastar. http://www.seastar-project.org/.

[35]

Spark and Docker: Your Spark development cycle just got 10x faster! https://towardsdatascience.com/spark-and-docker-your-spark-development-cycle-just-got-10x-faster-f41ed50c67fd.

[36]

TCP mmap() program. https://lwn.net/Articles/752197/.

[37]

What is container management and why is it important. https://searchitoperations.techtarget.com/definition/container-management-software.

[38]

Why use Docker containers for machine learning development? https://aws.amazon.com/cn/blogs/opensource/why-use-docker-containers-for-machine-learning-development/.

[39]

Zero-copy TCP receive. https://lwn.net/Articles/752188/.

[40]

P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D.K. Panda. Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In Proc. IEEE ISPASS, 2004.

[41]

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. USENIX OSDI, 2014.

[42]

Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proc. IEEE/ACM MICRO, 2016.

[43]

Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019.

Digital Library

[44]

Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, Rui Ren, Hong Xu, Bin Yang, and Zhengwei Qi. OPS: Optimized shuffle management system for Apache Spark. In Proc. ACM ICPP, 2020.

Digital Library

[45]

Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014.

[46]

Weibei Fan, Jing He, Zhijie Han, Peng Li, and Ruchuan Wang. Intelligent resource scheduling based on locality principle in data center networks. IEEE Communications Magazine, 58(10):94--100, 2020.

[47]

Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. Low-latency communication for fast DBMS using RDMA and shared memory. In Proc. IEEE ICDE, 2020.

[48]

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018.

[49]

D. Goldenberg, M. Kagan, R. Ravid, and M.S. Tsirkin. Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? In Proc. IEEE HOTI, 2005.

[50]

Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. MegaPipe: A new programming interface for scalable network I/O. In Proc. USENIX OSDI, 2012.

[51]

Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA for Virtual Private Cloud. In Proc. ACM SIGCOMM, 2020.

[52]

Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry. PASTE: A network programming interface for non-volatile main memory. In Proc. USENIX NSDI, 2018.

[53]

Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. NetVM: High performance and flexible networking using virtualization on commodity platforms. In Proc. USENIX NSDI, 2014.

[54]

EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. USENIX NSDI, 2014.

[55]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. USENIX OSDI, 2020.

[56]

Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019.

[57]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014.

Digital Library

[58]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016.

Digital Library

[59]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016.

[60]

Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. Iron: Isolating network-based CPU in container environments. In Proc. USENIX NSDI, 2018.

[61]

Daehyeok Kim, Tianlong Yu, Hongqiang Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: Software-based virtual RDMA networking for containerized clouds. In Proc. USENIX NSDI, 2019.

[62]

Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. NFVnice: Dynamic backpressure and scheduling for NFV service chains. In Proc. ACM SIGCOMM, 2017.

[63]

Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. Parallelizing packet processing in container overlay networks. In Proc. ACM EuroSys, 2021.

Digital Library

[64]

Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020.

[65]

Jian Li, Shuai Xue, Wang Zhang, Ruhui Ma, Zhengwei Qi, and Haibing Guan. When I/O interrupt becomes system bottleneck: Efficiency and scalability enhancement for SR-IOV network virtualization. IEEE Transactions on Cloud Computing, 7(4):1183--1196, 2019.

[66]

Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. Scalable kernel TCP design and implementation for short-lived connections. In Proc. ASPLOS, 2016.

Digital Library

[67]

Glenn K. Lockwood, Mahidhar Tatineni, and Rick Wagner. SR-IOV: Performance benefits for virtualized interconnects. In Proc. ACM XSEDE, 2014.

[68]

Patrick MacArthur and Robert D. Russell. An Efficient Method for Stream Semantics over RDMA. In Proc. IEEE IPDPS, 2014.

Digital Library

[69]

Ilias Marinos, Robert NM Watson, and Mark Handley. Network stack specialization for performance. In Proc. ACM SIGCOMM, 2014.

Digital Library

[70]

YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. AccelTCP: Accelerating network applications with stateful TCP offloading. In Proc. USENIX NSDI, 2020.

[71]

Jaehyun Nam, Seungsoo Lee, Hyunmin Seo, Phil Porras, Vinod Yegneswaran, and Seungwon Shin. BASTION: A security enforcement network stack for container networks. In Proc. USENIX ATC, 2020.

[72]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. In Proc. ACM SIGCOMM, 2018.

Digital Library

[73]

Zhixiong Niu, Hong Xu, Peng Cheng, Qiang Su, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: Making network stack part of the virtualized infrastructure. In Proc. USENIX ATC, 2020.

[74]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed DNN training acceleration. In Proc. ACM SOSP, 2019.

Digital Library

[75]

Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. ACM SIGCOMM, 2020.

Digital Library

[76]

Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. USENIX OSDI, 2010.

[77]

Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for Datacenter Applications. In Proc. ACM SOSP, 2017.

[78]

Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020.

[79]

Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. StackMap: Low-latency networking with the OS stack and dedicated NICs. In Proc. USENIX ATC, 2016.

[80]

Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.

[81]

Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phil Lopreiato, Gregoire Todeschi, KK Ramakrishnan, and Timothy Wood. OpenNetVM: A platform for high performance network service chains. In Proc. ACM HotMiddlebox, 2015.

[82]

Dongfang Zhao, Mohamed Mohamed, and Heiko Ludwig. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing, 8(2):635--646, 2020.

[83]

Chao Zheng, Qiuwen Lu, Jia Li, Qinyun Liu, and Binxing Fang. A flexible and efficient container-based NFV platform for middlebox networking. In Proc. ACM SAC, 2018.

Digital Library

[84]

Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: OS kernel support for a low-overhead container overlay network. In Proc. USENIX NSDI, 2019.

Cited By

Meng QZhang YZhang SWang ZZhang TLuo HRen F(2024)Switch-Assistant Loss Recovery for RDMA Transport ControlIEEE/ACM Transactions on Networking10.1109/TNET.2023.333666132:3(2069-2084)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3336661
You MSeo MKim JShin SNam J(2024)Hyperion: Hardware-Based High-Performance and Secure System for Container NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.340317512:3(844-858)Online publication date: Jul-2024
https://doi.org/10.1109/TCC.2024.3403175
Zhang ZXia CLiang CLi JYu CBie TMartin RDan DWang XLiu YGuan H(2024)Un-IOV: Achieving Bare-Metal Level I/O Virtualization Performance for Cloud Usage With Migratability, Scalability and TransparencyIEEE Transactions on Computers10.1109/TC.2024.337558973:7(1655-1668)Online publication date: Jul-2024
https://doi.org/10.1109/TC.2024.3375589
Show More Cited By

Index Terms

PipeDevice: a hardware-software co-design approach to intra-host container communication
1. Networks
  1. Network architectures

Recommendations

PipeDevice: a hardware-software co-design approach to intra-host container communication
SIGCOMM '22: Proceedings of the SIGCOMM '22 Poster and Demo Sessions

Containers have become prevalent in public clouds due to the performance, portability, and deployment benefits compared to virtual machines [1, 8, 14]. They support a wide variety of workloads, from microservices to data analytics and machine learning. ...
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
ASPLOS '17

DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
Asplos'17

DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

November 2022

431 pages

ISBN:9781450395083

DOI:10.1145/3555050

General Chairs:
Giuseppe Bianchi
University of Rome Tor Vergata, Italy
,
Alessandro Mei
Sapienza University of Rome, Italy

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Microsoft
Research Grants Council of Hong Kong
The Chinese University of Hong Kong

Conference

CoNEXT '22

Sponsor:

SIGCOMM

CoNEXT '22: The 18th International Conference on emerging Networking EXperiments and Technologies

December 6 - 9, 2022

Roma, Italy

Acceptance Rates

CoNEXT '22 Paper Acceptance Rate 28 of 151 submissions, 19%;

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
535
Total Downloads

Downloads (Last 12 months)151
Downloads (Last 6 weeks)24

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meng QZhang YZhang SWang ZZhang TLuo HRen F(2024)Switch-Assistant Loss Recovery for RDMA Transport ControlIEEE/ACM Transactions on Networking10.1109/TNET.2023.333666132:3(2069-2084)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3336661
You MSeo MKim JShin SNam J(2024)Hyperion: Hardware-Based High-Performance and Secure System for Container NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.340317512:3(844-858)Online publication date: Jul-2024
https://doi.org/10.1109/TCC.2024.3403175
Zhang ZXia CLiang CLi JYu CBie TMartin RDan DWang XLiu YGuan H(2024)Un-IOV: Achieving Bare-Metal Level I/O Virtualization Performance for Cloud Usage With Migratability, Scalability and TransparencyIEEE Transactions on Computers10.1109/TC.2024.337558973:7(1655-1668)Online publication date: Jul-2024
https://doi.org/10.1109/TC.2024.3375589
Marcelino CNastic SSha KBanerjee SChen J(2023)CWASI: A WebAssembly Runtime Shim for Inter-function Communication in the Serverless Edge-Cloud ContinuumProceedings of the Eighth ACM/IEEE Symposium on Edge Computing10.1145/3583740.3626611(158-170)Online publication date: 6-Dec-2023
https://dl.acm.org/doi/10.1145/3583740.3626611
Qi STsai HLiu YRamakrishnan KChen J(2023)X-IO: A High-performance Unified I/O Interface using Lock-free Shared Memory Processing2023 IEEE 9th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft57336.2023.10175428(107-115)Online publication date: 19-Jun-2023
https://doi.org/10.1109/NetSoft57336.2023.10175428

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents