More Web Proxy on the site http://driver.im/

research-article

NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors

Authors:

Sue MoonAuthors Info & Claims

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

Article No.: 22, Pages 1 - 14

https://doi.org/10.1145/2741948.2741969

Published: 17 April 2015 Publication History

Abstract

We present the NBA framework, which extends the architecture of the Click modular router to exploit modern hardware, adapts to different hardware configurations, and reaches close to their maximum performance without manual optimization. NBA takes advantages of existing performance-excavating solutions such as batch processing, NUMA-aware memory management, and receive-side scaling with multi-queue network cards. Its abstraction resembles Click but also hides the details of architecture-specific optimization, batch processing that handles the path diversity of individual packets, CPU/GPU load balancing, and complex hardware resource mappings due to multi-core CPUs and multi-queue network cards. We have implemented four sample applications: an IPv4 and an IPv6 router, an IPsec encryption gateway, and an intrusion detection system (IDS) with Aho-Corasik and regular expression matching. The IPv4/IPv6 router performance reaches the line rate on a commodity 80 Gbps machine, and the performances of the IPsec gateway and the IDS reaches above 30 Gbps. We also show that our adaptive CPU/GPU load balancer reaches near-optimal throughput in various combinations of sample applications and traffic conditions.

Supplementary Material

MP4 File (a22-sidebyside.mp4)

Download
850.31 MB

References

[1]

General Purpose computation on GPUs. http://www.gpgpu.org.

[2]

NVIDIA CUDA. http://developer.nvidia.com/cuda.

[3]

Intel® DPDK (Data Plane Development Kit). https://dpdk.org.

[4]

PCRE (Perl Compatible Regular Expressions). http://pcre.org.

[5]

PF_RING ZC (Zero Copy). http://www.ntop.org/products/pf_ring/pf_ring-zc-zero-copy/.

[6]

PacketShader I/O Engine. https://github.com/PacketShader/Packet-IO-Engine.

[7]

M. Ahmed, F. Huici, and A. Jahanpanah. Enabling dynamic network processing with ClickOS. In ACM SIGCOMM. ACM, 2012.

Digital Library

[8]

A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18 (6): 333--340, 1975.

Digital Library

[9]

M. B. Anwer and N. Feamster. Building a fast, virtualized data plane with programmable hardware. In Proceedings of the 1st ACM workshop on Virtualized infrastructure systems and architectures, VISA '09. ACM, 2009.

Digital Library

[10]

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2): 187--198, 2011.

Digital Library

[11]

A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In OSDI, pages 49--65, 2014.

Digital Library

[12]

G. Chanda. The Market Need for 40 Gigabit Ethernet. http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-6500-series-switches/white_paper_c11-696667.pdf, 2012. A white paper from Cisco Systems.

[13]

B. Chen and R. Morris. Flexible control of parallelism in a multiprocessor PC router. In USENIX ATC, 2001.

Digital Library

[14]

E. Coffman and R. Graham. Optimal scheduling for two-processor systems. Acta Informatica, 1(3): 200--213, 1972.

Digital Library

[15]

M. Dobrescu, N. Egi, K. Argyraki, B. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy. Route-Bricks: Exploiting parallelism to scale software routers. In ACM SOSP, volume 9. Citeseer, 2009.

Digital Library

[16]

M. Dobrescu, K. Argyraki, G. Iannaccone, M. Manesh, and S. Ratnasamy. Controlling parallelism in a multicore software router. In PRESTO, page 2. ACM, 2010.

Digital Library

[17]

P. Druschel, L. L. Peterson, and B. S. Davie. Experiences with a high-speed network adaptor: A software perspective. ACM, 1994.

Digital Library

[18]

N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, L. Mathy, and P. Papadimitriou. Forward path architectures for multi-core software routers. In ACM Co-NEXT PRESTO Workshop, 2010.

Digital Library

[19]

M. Garey and R. Graham. Bounds for multiprocessor scheduling with resource constraints. SIAM J. Comput., 4(2): 187--200, 1975.

Digital Library

[20]

P. Gupta, S. Lin, and N. McKeown. Routing lookups in hardware at memory access speeds. In IEEE INFOCOM, 1998.

[21]

S. Han, K. Jang, K. Park, and S. Moon. PacketShader: a GPU-accelerated software router. In ACM SIGCOMM Computer Communication Review, pages 195--206. ACM, 2010.

Digital Library

[22]

S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy. MegaPipe: A New Programming Interface for Scalable Network I/O. In OSDI, pages 135--148, 2012.

Digital Library

[23]

T. Hu. Parallel sequencing and assembly line problems. Operations research, pages 841--848, 1961.

[24]

J. Hwang, K. Ramakrishnan, and T. Wood. NetVM: high performance and flexible networking using virtualization on commodity platforms. In USENIX NSDI, 2014.

Digital Library

[25]

S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, et al. B4: Experience with a globally-deployed software defined wan. In ACM SIGCOMM. ACM, 2013.

Digital Library

[26]

M. Jamshed, J. Lee, S. Moon, I. Yun, D. Kim, S. Lee, Y. Yi, and K. Park. Kargus: a highly-scalable software-based intrusion detection system. In ACM CCS, 2012.

Digital Library

[27]

K. Jang, S. Han, S. Han, S. Moon, and K. Park. SSLShader: cheap SSL acceleration with commodity processors. In USENIX NSDI, 2011.

Digital Library

[28]

E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: a highly scalable user-level TCP stack for multicore systems. USENIX NSDI, 2014.

Digital Library

[29]

J. Kim, S. Huh, K. Jang, K. Park, and S. Moon. The power of batching in the Click modular router. In APSYS. ACM, 2012.

Digital Library

[30]

S. Kim, S. Huh, Y. Hu, X. Zhang, A. Wated, E. Witchel, and M. Silberstein. GPUnet: Networking abstractions for GPU programs. In USENIX OSDI, 2014.

Digital Library

[31]

E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek. The Click modular router. ACM TOCS, 18(3): 263--297, 2000.

Digital Library

[32]

L. Koromilas, G. Vasiliadis, I. Manousakis, and S. Ioannidis. Efficient software packet processing on heterogeneous and asymmetric hardware architectures. In ANCS. IEEE Press, ACM/IEEE, 2014.

Digital Library

[33]

H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: a holistic approach to fast in-memory key-value storage. In USENIX NSDI, 2014.

Digital Library

[34]

J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo. NetFPGA--an open platform for gigabit-rate network switching and routing. In MSE. IEEE, 2007.

Digital Library

[35]

G. Lu, C. Guo, Y. Li, Z. Zhou, T. Yuan, H. Wu, Y. Xiong, R. Gao, and Y. Zhang. ServerSwitch: A Programmable and High Performance Platform for Data Center Networks. In USENIX NSDI, 2011.

Digital Library

[36]

C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In IEEE/ACM MICRO, 2009.

Digital Library

[37]

I. Marinos, R. N. Watson, and M. Handley. Network stack specialization for performance. In ACM HotNets. ACM, 2013.

Digital Library

[38]

J. C. Mogul, P. Yalagandula, J. Tourrilhes, R. McGeer, S. Banerjee, T. Connors, and P. Sharma. Orphal: API design challenges for open router platforms on proprietary hardware. In ACM SIGCOMM HotNets Workshop, 2008.

[39]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics 2005, State of the Art Reports, Aug. 2005.

[40]

C. Partridge, P. Carvey, E. Burgess, I. Castinerya, T. Clarke, L. Graham, M. Hathaway, P. Herman, A. King, S. Kohalmi, T. Ma, J. Mcallen, T. Mendez, W. Milliken, R. Pettyjohn, J. Rokosz, J. Seeger, M. Sollins, S. Storch, B. Tober, G. Troxel, D. Waitzman, and S. Winterble. A 50-Gb/s IP router. IEEE/ACM Transactions on Networking, June 1998.

Digital Library

[41]

P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, et al. Ananta: cloud scale load balancing. In ACM SIGCOMM. ACM, 2013.

Digital Library

[42]

A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris. Improving network connection locality on multicore systems. In EuroSys. ACM, 2012.

Digital Library

[43]

S. Peter, J. Li, I. Zhang, D. R. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe. Arrakis: The operating system is the control plane. In USENIX OSDI, 2014.

Digital Library

[44]

L. Rizzo. netmap: A Novel Framework for Fast Packet I/O. In USENIX ATC, 2012.

Digital Library

[45]

J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo. Implications of classical scheduling results for real-time systems. Computer, 28(6): 16--25, 1995.

Digital Library

[46]

J. E. Stone, D. Gohara, and G. Shi. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3): 66, 2010.

Digital Library

[47]

W. Sun and R. Ricci. Fast and flexible: parallel packet processing with GPUs and click. In ANCS. ACM/IEEE, 2013.

Digital Library

[48]

K. Thompson. Programming techniques: Regular expression search algorithm. Communications of the ACM, 1968.

Digital Library

[49]

H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. Parallel and Distributed Systems, IEEE Transactions on, 13(3): 260--274, mar 2002. ISSN 1045-9219.

Digital Library

[50]

G. Vasiliadis, S. Antonatos, M. Polychronakis, E. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In RAID, 2008.

Digital Library

[51]

G. Vasiliadis, M. Polychronakis, and S. Ioannidis. MIDeA: A multi-parallel intrusion detection architecture. In ACM CCS. ACM, 2011. ISBN 978-1-4503-0948-6. URL http://doi.acm.org/10.1145/2046707.2046741.

Digital Library

[52]

G. Vasiliadis, L. Koromilas, M. Polychronakis, and S. Ioannidis. GASPP: a GPU-accelerated stateful packet processing framework. In USENIX ATC. USENIX Association, 2014.

Digital Library

[53]

M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In ACM SIGCOMM, 1997.

Digital Library

[54]

D. Zhou, B. Fan, H. Lim, M. Kaminsky, and D. G. Andersen. Scalable, high performance ethernet forwarding with CUCKOOSWITCH. In ACM CoNEXT, 2013.

Digital Library

Cited By

Huang ZTan YZhu YTan HLi K(2024)MTDA: Efficient and Fair DPU Offloading Method for Multiple TenantsIEEE Transactions on Services Computing10.1109/TSC.2024.3433588(1-14)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3433588
Fingler HTarte IYu HSzekely AHu BAkella ARossbach CAamodt TJerger NSwift M(2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575697
Zeng DZhu AGu LLi PChen QGuo M(2023)Enabling Efficient Spatio-Temporal GPU Sharing for Network Function VirtualizationIEEE Transactions on Computers10.1109/TC.2023.327854172:10(2963-2977)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3278541
Show More Cited By

Index Terms

NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors

Recommendations

Designing and dynamically load balancing hybrid LU for multi/many-core

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show ...
Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity ...
Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

April 2015

503 pages

ISBN:9781450332385

DOI:10.1145/2741948

General Chair:
Laurent Réveillère
LaBRI, University of Bordeaux, France
,
Program Chairs:
Tim Harris
Oracle Labs, UK
,
Maurice Herlihy
Brown University

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Ministry of Future Creation and Science

Conference

EuroSys '15

Sponsor:

SIGOPS

EuroSys '15: Tenth EuroSys Conference 2015

April 21 - 24, 2015

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
1,468
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)7

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang ZTan YZhu YTan HLi K(2024)MTDA: Efficient and Fair DPU Offloading Method for Multiple TenantsIEEE Transactions on Services Computing10.1109/TSC.2024.3433588(1-14)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3433588
Fingler HTarte IYu HSzekely AHu BAkella ARossbach CAamodt TJerger NSwift M(2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575697
Zeng DZhu AGu LLi PChen QGuo M(2023)Enabling Efficient Spatio-Temporal GPU Sharing for Network Function VirtualizationIEEE Transactions on Computers10.1109/TC.2023.327854172:10(2963-2977)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3278541
Deyannis DPapadogiannaki EChrysos GGeorgopoulos KIoannidis S(2022)The Diversification and Enhancement of an IDS Scheme for the Cybersecurity Needs of Modern Supply ChainsElectronics10.3390/electronics1113194411:13(1944)Online publication date: 22-Jun-2022
https://doi.org/10.3390/electronics11131944
Wang JLévai TLi ZVieira MGovindan RRaghavan BGavrilovska AAltınbüken DBinnig C(2022)QuadrantProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563471(493-509)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563471
Vasiliadis GTsirbas RIoannidis S(2022)The Best of Many Worlds: Scheduling Machine Learning Inference on CPU-GPU Integrated Architectures2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00017(55-64)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00017
Papadogiannaki EIoannidis S(2021)Acceleration of Intrusion Detection in Encrypted Network Traffic Using Heterogeneous HardwareSensors10.3390/s2104114021:4(1140)Online publication date: 6-Feb-2021
https://doi.org/10.3390/s21041140
Katsikas GBarbette TKostić DMaguire JSteinert R(2021)MetronACM Transactions on Computer Systems10.1145/346562838:1-2(1-45)Online publication date: 8-Jul-2021
https://dl.acm.org/doi/10.1145/3465628
Lévai TNémeth FRaghavan BRétvári GBhagwan RPorter G(2020)BatchyProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388289(633-650)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388289
Dhakal AKulkarni SRamakrishnan KFonseca RDelimitrou COoi B(2020)GSLICEProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421284(492-506)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3419111.3421284
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents