[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3555050.3569118acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

PipeDevice: a hardware-software co-design approach to intra-host container communication

Published: 30 November 2022 Publication History

Abstract

Containers are prevalently adopted due to the deployment and performance advantages over virtual machines. For many containerized data-intensive applications, however, bulky data transfers may pose performance issues. In particular, communication across co-located containers on the same host incurs large overheads in memory copy and the kernel's TCP stack. Existing solutions such as shared-memory networking and RDMA have their own limitations, including insufficient memory isolation and limited scalability.
This paper presents PipeDevice, a new system for low overhead intra-host container communication. PipeDevice follows a hardware-software co-design approach --- it offloads data forwarding entirely onto hardware, which accesses application data in hugepages on the host, thereby eliminating CPU overhead from memory copy and TCP processing. PipeDevice preserves memory isolation and scales well to connections, making it deployable in public clouds. Isolation is achieved by allocating dedicated memory to each connection from hugepages. To achieve high scalability, PipeDevice stores the connection states entirely in host DRAM and manages them in software. Evaluation with a prototype implementation on commodity FPGA shows that for delivering 80 Gbps across containers PipeDevice saves 63.2% CPU compared to kernel TCP stack, and 40.5% over FreeFlow. PipeDevice provides salient benefits to applications. For example, we port baidu-allreduce to PipeDevice and obtain ~2.2× gains in allreduce throughput.

References

[1]
Achieving Fast, Scalable I/O for Virtualized Servers. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/scalable-i-o-virtualized-servers-paper.pdf.
[2]
Amazon web service. https://aws.amazon.com/.
[3]
AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster.
[4]
Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce.
[5]
bpftrace: High-level tracing language for linux systems. https://bpftrace.org/.
[6]
Cilium. https://github.com/cilium/cilium.
[7]
Cloud-Native Network Functions. https://www.cisco.com/c/en/us/solutions/service-provider/industry/cable/cloud-native-network-functions.html.
[8]
Container management in 2021: In-depth guide. https://research.aimultiple.com/container-management/.
[9]
containerd: an industry-standard container runtime with an emphasis on simplicity, robustness and portability. https://containerd.io/.
[10]
Deep learning containers in Google Cloud. https://cloud.google.com/deep-learning-containers.
[11]
Enable Istio proxy sidecar injection in Oracle cloud native environment. https://docs.oracle.com/en/learn/ocne-sidecars/index.html#introduction.
[12]
F-Stack: A high performance userspace stack based on FreeBSD 11.0 stable. http://www.f-stack.org/.
[13]
Fast memcpy with SPDK and Intel I/OAT DMA Engine. https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html.
[14]
FreeFlow TCP. https://github.com/microsoft/Freeflow/tree/tcp.
[15]
Gloo. https://github.com/facebookincubator/gloo.
[16]
Implement mmap() for zero copy receive. https://lwn.net/Articles/752207/.
[17]
Implementing TCP Sockets over RDMA. https://www.openfabrics.org/images/eventpresos/workshops2014/IBUG/presos/Thursday/PDF/09_Sockets-over-rdma.pdf.
[18]
Information about the TCP chimney offload, receive side scaling, and network direct memory access features in Windows server 2008. https://support.microsoft.com/en-us/help/951037/information-about-the-tcp-chimney-offload-receive-side-scaling-and-net.
[19]
Intel Arria 10 product table. https://www.intel.co.id/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.
[20]
Intel C610 Series Chipset Datasheet. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x99-chipset-pch-datasheet.pdf.
[21]
Intel DSA specification. https://www.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html.
[22]
Intel QuickData Technology Software Guide. https://www.intel.com/content/dam/doc/white-paper/quickdata-technology-software-guide-for-linux-paper.pdf.
[23]
IOAT benchmark. https://github.com/spdk/spdk/tree/master/examples/ioat/perf.
[24]
io_uring. https://man.archlinux.org/man/io_uring.7.en.
[25]
Istio. https://istio.io/latest/about/service-mesh/.
[26]
Linkerd architecture. https://linkerd.io/2.11/reference/architecture/.
[27]
Mellanox BlueField-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf.
[28]
Mellanox BlueField DPU DMA Guide. https://docs.nvidia.com/doca/sdk/dma-samples/index.html.
[29]
Microsoft Azure. https://azure.microsoft.com/.
[30]
NCCL. https://github.com/NVIDIA/nccl.
[31]
Open MPI: Open source high performance computing. https://www.open-mpi.org/.
[32]
Perftest. https://github.com/linux-rdma/perftest.
[33]
Run Spark applications with Docker using Amazon EMR 6.x. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html.
[34]
Seastar. http://www.seastar-project.org/.
[35]
Spark and Docker: Your Spark development cycle just got 10x faster! https://towardsdatascience.com/spark-and-docker-your-spark-development-cycle-just-got-10x-faster-f41ed50c67fd.
[36]
TCP mmap() program. https://lwn.net/Articles/752197/.
[37]
What is container management and why is it important. https://searchitoperations.techtarget.com/definition/container-management-software.
[38]
Why use Docker containers for machine learning development? https://aws.amazon.com/cn/blogs/opensource/why-use-docker-containers-for-machine-learning-development/.
[39]
Zero-copy TCP receive. https://lwn.net/Articles/752188/.
[40]
P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D.K. Panda. Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In Proc. IEEE ISPASS, 2004.
[41]
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. USENIX OSDI, 2014.
[42]
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proc. IEEE/ACM MICRO, 2016.
[43]
Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019.
[44]
Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, Rui Ren, Hong Xu, Bin Yang, and Zhengwei Qi. OPS: Optimized shuffle management system for Apache Spark. In Proc. ACM ICPP, 2020.
[45]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014.
[46]
Weibei Fan, Jing He, Zhijie Han, Peng Li, and Ruchuan Wang. Intelligent resource scheduling based on locality principle in data center networks. IEEE Communications Magazine, 58(10):94--100, 2020.
[47]
Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. Low-latency communication for fast DBMS using RDMA and shared memory. In Proc. IEEE ICDE, 2020.
[48]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018.
[49]
D. Goldenberg, M. Kagan, R. Ravid, and M.S. Tsirkin. Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? In Proc. IEEE HOTI, 2005.
[50]
Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. MegaPipe: A new programming interface for scalable network I/O. In Proc. USENIX OSDI, 2012.
[51]
Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA for Virtual Private Cloud. In Proc. ACM SIGCOMM, 2020.
[52]
Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry. PASTE: A network programming interface for non-volatile main memory. In Proc. USENIX NSDI, 2018.
[53]
Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. NetVM: High performance and flexible networking using virtualization on commodity platforms. In Proc. USENIX NSDI, 2014.
[54]
EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. USENIX NSDI, 2014.
[55]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. USENIX OSDI, 2020.
[56]
Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019.
[57]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014.
[58]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016.
[59]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016.
[60]
Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. Iron: Isolating network-based CPU in container environments. In Proc. USENIX NSDI, 2018.
[61]
Daehyeok Kim, Tianlong Yu, Hongqiang Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: Software-based virtual RDMA networking for containerized clouds. In Proc. USENIX NSDI, 2019.
[62]
Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. NFVnice: Dynamic backpressure and scheduling for NFV service chains. In Proc. ACM SIGCOMM, 2017.
[63]
Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. Parallelizing packet processing in container overlay networks. In Proc. ACM EuroSys, 2021.
[64]
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020.
[65]
Jian Li, Shuai Xue, Wang Zhang, Ruhui Ma, Zhengwei Qi, and Haibing Guan. When I/O interrupt becomes system bottleneck: Efficiency and scalability enhancement for SR-IOV network virtualization. IEEE Transactions on Cloud Computing, 7(4):1183--1196, 2019.
[66]
Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. Scalable kernel TCP design and implementation for short-lived connections. In Proc. ASPLOS, 2016.
[67]
Glenn K. Lockwood, Mahidhar Tatineni, and Rick Wagner. SR-IOV: Performance benefits for virtualized interconnects. In Proc. ACM XSEDE, 2014.
[68]
Patrick MacArthur and Robert D. Russell. An Efficient Method for Stream Semantics over RDMA. In Proc. IEEE IPDPS, 2014.
[69]
Ilias Marinos, Robert NM Watson, and Mark Handley. Network stack specialization for performance. In Proc. ACM SIGCOMM, 2014.
[70]
YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. AccelTCP: Accelerating network applications with stateful TCP offloading. In Proc. USENIX NSDI, 2020.
[71]
Jaehyun Nam, Seungsoo Lee, Hyunmin Seo, Phil Porras, Vinod Yegneswaran, and Seungwon Shin. BASTION: A security enforcement network stack for container networks. In Proc. USENIX ATC, 2020.
[72]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. In Proc. ACM SIGCOMM, 2018.
[73]
Zhixiong Niu, Hong Xu, Peng Cheng, Qiang Su, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: Making network stack part of the virtualized infrastructure. In Proc. USENIX ATC, 2020.
[74]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed DNN training acceleration. In Proc. ACM SOSP, 2019.
[75]
Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. ACM SIGCOMM, 2020.
[76]
Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. USENIX OSDI, 2010.
[77]
Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for Datacenter Applications. In Proc. ACM SOSP, 2017.
[78]
Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020.
[79]
Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. StackMap: Low-latency networking with the OS stack and dedicated NICs. In Proc. USENIX ATC, 2016.
[80]
Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.
[81]
Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phil Lopreiato, Gregoire Todeschi, KK Ramakrishnan, and Timothy Wood. OpenNetVM: A platform for high performance network service chains. In Proc. ACM HotMiddlebox, 2015.
[82]
Dongfang Zhao, Mohamed Mohamed, and Heiko Ludwig. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing, 8(2):635--646, 2020.
[83]
Chao Zheng, Qiuwen Lu, Jia Li, Qinyun Liu, and Binxing Fang. A flexible and efficient container-based NFV platform for middlebox networking. In Proc. ACM SAC, 2018.
[84]
Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: OS kernel support for a low-overhead container overlay network. In Proc. USENIX NSDI, 2019.

Cited By

View all
  • (2024)Switch-Assistant Loss Recovery for RDMA Transport ControlIEEE/ACM Transactions on Networking10.1109/TNET.2023.333666132:3(2069-2084)Online publication date: Jun-2024
  • (2024)Hyperion: Hardware-Based High-Performance and Secure System for Container NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.340317512:3(844-858)Online publication date: Jul-2024
  • (2024)Un-IOV: Achieving Bare-Metal Level I/O Virtualization Performance for Cloud Usage With Migratability, Scalability and TransparencyIEEE Transactions on Computers10.1109/TC.2024.337558973:7(1655-1668)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. PipeDevice: a hardware-software co-design approach to intra-host container communication

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
    November 2022
    431 pages
    ISBN:9781450395083
    DOI:10.1145/3555050
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. container communication
    2. hardware-software co-design

    Qualifiers

    • Research-article

    Funding Sources

    • Microsoft
    • Research Grants Council of Hong Kong
    • The Chinese University of Hong Kong

    Conference

    CoNEXT '22
    Sponsor:

    Acceptance Rates

    CoNEXT '22 Paper Acceptance Rate 28 of 151 submissions, 19%;
    Overall Acceptance Rate 198 of 789 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)151
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Switch-Assistant Loss Recovery for RDMA Transport ControlIEEE/ACM Transactions on Networking10.1109/TNET.2023.333666132:3(2069-2084)Online publication date: Jun-2024
    • (2024)Hyperion: Hardware-Based High-Performance and Secure System for Container NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.340317512:3(844-858)Online publication date: Jul-2024
    • (2024)Un-IOV: Achieving Bare-Metal Level I/O Virtualization Performance for Cloud Usage With Migratability, Scalability and TransparencyIEEE Transactions on Computers10.1109/TC.2024.337558973:7(1655-1668)Online publication date: Jul-2024
    • (2023)CWASI: A WebAssembly Runtime Shim for Inter-function Communication in the Serverless Edge-Cloud ContinuumProceedings of the Eighth ACM/IEEE Symposium on Edge Computing10.1145/3583740.3626611(158-170)Online publication date: 6-Dec-2023
    • (2023)X-IO: A High-performance Unified I/O Interface using Lock-free Shared Memory Processing2023 IEEE 9th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft57336.2023.10175428(107-115)Online publication date: 19-Jun-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media