research-article

Open access

A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing

Authors:

Matthias Nickel,

Diana GöhringerAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 18, Issue 1

Article No.: 10, Pages 1 - 34

https://doi.org/10.1145/3699514

Published: 25 December 2024 Publication History

PDF eReader

Abstract

By moving data and computation away from the end user to more powerful servers in the cloud or to cloudlets at the edge, end user devices only need to compute locally for small amounts of data and when low latency is required. However, with the advent of 6G and Internet-of-Everything, the demand for more powerful networks continues to grow. The introduction of Software-Defined Networking and Network Function Virtualization has allowed us to rethink networks and use them for more than just routing data to servers. In addition, the use of more powerful network devices is bringing new life to the concept of active networks in the form of in-network computing. In-Network Computing provides the ability to move applications into the network and process data on programmable network devices as they are transmitted. In this work, we provide an overview of in-network computing and its enabling technologies. We take a look at the programmability and different hardware architectures for SmartNICs and switches, focusing primarily on accelerators such as FPGAs. We discuss the state of the art and challenges in this area, and look at CGRAs, a class of hardware accelerators that have not been widely discussed in this context.

1 Introduction

With the introduction of 5G and the advancement of Internet of Things (IoT) technology, the demands on network traffic and network processing are rapidly increasing [93]. With the focus now on the development of 6G and the Internet of Everything (IoE), this trend will accelerate even further [73, 91]. In addition to the need for high throughput and low latency, there is a new demand for energy efficient computing that requires better utilization of available network resources. However, legacy networks are transmission pipelines that are primarily used to move data from typically low-performance data sources to centralized cloud servers where the data are computed. Rather than trying to increase the computing power of the data source itself, which is especially limited for mobile devices, cloud computing offers the benefits of greater scalability, more available resources, and better cost efficiency.

To address this problem, the concepts of Multi-Access (originally Mobile) Edge Computing [93], fog computing [99], and cloudlets have been introduced. Cloudlets are dedicated and computationally weaker (compared to the cloud) servers placed in close proximity to user devices [89, 96]. However, even though edge computing reduces latency and network load, the problem of insufficient network utilization remains. With the introduction of Software-Defined Networking (SDN) [89] and Network Function Virtualization (NFV) [128], as well as network devices that can do more than data forwarding or simple network-specific transactions such as Network Address Translation (NAT) [56], the research area of In-Network Computing (INC) [77] has emerged. INC moves the computation (of parts) of applications that previously ran on servers into the network. This not only enables more efficient use of available resources, but can also reduce network traffic load, increase throughput, and reduce latency, making INC an important piece of the puzzle on the road to 6G networks. To enable INC, several hardware designs for Smart Network Interface Card (SmartNIC) and programmable switches have been proposed and implemented by the research community as well as by large companies such as Intel, NVIDIA, and AMD/Xilinx. The range of designs explored includes solutions using General Purpose Processors (GPPs), Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), or System-on-a-Chip (SoC), as well as lesser-known architectures such as Coarse-Grained Reconfigurable Architectures (CGRAs) [83].

The rest of this work is organized as follows: Section 2 presents related work and our motivation. Sections 3 and 4 give a brief introduction to the enabling technologies for INC itself. Section 5 discusses the taxonomy presented in this work. Related to the taxonomy, the following sections will discuss the types of network devices for INC (Section 6), the programmability of these devices (Section 7), and existing architectures (Section 8). Section 9 presents research on proposed hardware designs, including for extended Berkeley Packet Filter (eBPF)/eXpress Data Path (XDP) [119] program offloading and optimized designs for specific domains. Section 10 discusses ongoing research challenges, and Section 11 concludes this work.

2 Related Work and Motivation

Comprehensive explorations of SDN and INC for applications and domains such as In-Band Network Telemetry (INT), caching, Machine Learning (ML), congestion control, Consensus Protocols (CPs), security, and resilience/robustness from different perspectives have been done by [67, 75, 77, 89], among others.

Kfoury et al. [75] and Hauser et al. [67] focused on the Programming Protocol-Independent Packet Processor (P4) Programmable Data Plane (PDP). Kfoury et al. [75] compared P4 programmable networks with non-P4 programmable SDNs. Hauser et al. [67] provide an overview and tutorial on different targets/architectures (ASIC, General Purpose (GP)/Software, Network Processing Unit (NPU), and FPGA), compilers, and APIs.

Michel et al. [89] placed a stronger emphasis on network-related applications for SDN. Besides applications, they categorized the PDP concerning architectures, abstraction, algorithms, and languages, paralleling with a similar categorization of architectures as in [67].

Kianpisheh and Taleb [77] introduced a classification for INC applications, dividing them into (general) PDP INC applications and technology-specific INC applications. The former includes In-Network Analytics (data aggregation, ML, etc.), In-Network Caching, In-Network Security, and In-Network Coordination, while the latter focuses on infrastructure-related applications such as cloud/edge computing, 4G/5G/6G, and NFV.

Other studies address hybrid network architectures that include non-SDN and SDN-capable devices [76], network virtualization [35, 64], the distributed control plane [38, 50, 71], or focus on specific domains such as ML [126, 134].

In our research, similar to [67, 75, 77, 89], we also examine works that focus on offloading applications to the network. However, in contrast to previous work, we want to propose a different classification for programmable network devices. For example, rather than categorizing them as SmartNICs, FPGAs, and switches, we emphasize that their architectural characteristics and data processing capabilities are independent of the network device type. For instance, an FPGA can be used to process data on both a Network Interface Card (NIC) and a switch. Our investigation focuses mainly on recent work that has investigated/proposed architectural designs, accelerators, and frameworks for INC, primarily prototyped on FPGAs. We will discuss FPGAs as an interesting choice for processing data in the network because they can provide flexibility, acceleration, and power efficiency in contrast to ASIC-based designs such as the Tofino that lack flexibility and GPP-based designs that lack power efficiency and acceleration capabilities. But we will also look at the problems in working with FPGAs, how coarse-grained granularity could complement them, and why CGRAs could be a considerable addition for data processing in dynamic network environments.

3 Enabling Technologies

In legacy networks, the decision on how data are forwarded as well as the functionality of the device was tightly coupled to the hardware. With the advent of SDN and NFV, the hardware-defined functionality and packet forwarding policy has been replaced by a more software-defined approach. SDN and NFV are two most important technologies for modern networks, enabling programmable and more flexible networks. In this section, we give a brief introduction to these two technologies.

3.1 SDN

Two important questions arise when transmitting network packets. The first is how the packets are transmitted from one device to another, and the second is how the packets are handled by each device (how they move through a device). To abstract the problem, the functionality of network devices can be logically divided into control plane and data plane [89].

The control plane is responsible for handling packet traffic transmitted via or generated by the network device. These packet processing policies include functions such as updating headers, forwarding packets, and device management, including the allocation and assignment of resources for the data plane. Since the control plane needs to make more complex decisions, these are usually implemented in software running on GPPs [137] rather than on specialized hardware such as an ASIC. However, since software processing on a GPPs is quite slow compared to a hardware implementation, it is often referred to as the slow path.

The data plane of a device, on the other hand, is responsible for processing or forwarding of packets that are defined by the policies of the control plane. To achieve high throughput and low latency, path switching/routing is usually implemented on an ASIC. However, for Virtual Switches (vSwitch), which are used for communication with Virtual Machines (VMs) and thus enable the simultaneous use of the physical infrastructure, the data plane can also be implemented in software. Although the data level of a vSwitch is slower than the data plane of a physical switch, it is still orders of magnitude faster than the control plane of a device.

In legacy networks and network devices, the control and data plane were tightly coupled. They therefore lacked flexibility and expandability. The introduction of SDN decouples the control and data plane and creates a (locally) centralized control plane through a network controller. The network controller enables functions such as global policy management, event-based triggering, centralization of network information, and other global network management and monitoring applications. According to the Open Network Foundation, the SDN architecture consists of three different layers that are connected to each other via interfaces [45] (Figure 1):

Fig. 1.

—

Application Layer: All SDN applications are located at this layer. The applications communicate with the control plane via the Northbound Interface (NBI).

—

Control Layer/Control Plane: Logically centralized layer that monitors the network, translates the requirements of the applications, and provides information for the applications via the NBI and controls the network infrastructure via the Southbound Interface (SBI).

—

Infrastructure Layer/Data Plane: Consists of the Network Elements and devices, responsible for the packet switching and data forwarding. This layer is exposed to the control layer through the SBI.

Because a centralized view of the network in the control plane is pursued the network controller was realized on a single machine. However, while this goal can be easily achieved for small networks, for larger networks like in data centers or for IoT-related networks with a large number of diverse actors each of them having different demands, a single centralized network controller becomes fast a bottleneck. Therefore, several approaches target multiple distributed controller in a network and create the global by synchronizing with each other [125]. This (physically distributed but) logically centralized can be realized in different ways. While the simplest approach is to synchronize between all controllers it scales not well. Therefore, approaches like Kandoo have an hierarchical approach. All small transactions are handled individually by each local controller. Larger transactions on the other hand are controlled by a single network controller, which has more computational capacities compared to the local controllers. SDN not only provides us with the capabilities to address high-bandwidth demands by monitoring and controlling the network traffic but also to adapt to changing business needs and a reduction of network management complexity and operations [54]. Researchers and network operators are capable to rethink the idea what networks are, as well as how to design and interact with them.

3.2 NFV

In legacy network architectures, any network function beyond simple forwarding, such as firewalls, load balancing, and NAT, was tightly coupled to dedicated middlebox hardware. As a result, any change in network functionality, such as an update to a new protocol, is expensive and time-consuming [128]. NFV virtualizes this Physical Network Function (PNF), allowing them to be deployed on commodity servers as software implementations running on VMs, eliminating the need for (expensive) dedicated middlebox hardware. This allows different network services to run on the same equipment, generally reducing network Capital Expenditure and Operation Expense compared to the traditional PNF approach. NFV is complementary to SDN, but interdependent. This means that SDN can be deployed and benefit from NFV and vice versa, but both can be deployed independently [1]. However, NFV has in common with SDN that both technologies aim to decouple control and application software from proprietary hardware to enable a more flexible and dynamic network infrastructure.

4 INC

The idea of programmable networks is not new, and was proposed as early as 1997 in the form of active networks [116]. The idea of active networks is to offload the computation of algorithms to the network devices, such as switches, so that they can perform more sophisticated tasks with the data packets than just forwarding them. Although the idea aroused interest, it was not successful at the time due to a lack of computing power and programming models for such network devices.

With the introduction of SDN and the emergence of high-performance programmable network devices such as SmartNICs and programmable switches, we have come a big step closer to realizing the idea of moving computation to the network. Mature programming languages such as P4 [18] and Protocol-Oblivious Forwarding (POF) [107], which is an extension of OpenFlow, also allow heterogeneous network devices to be programmed without having to write code for each device individually. However, while P4 and POF both share the same goal of bringing programmability and reconfigurability to the network, P4 has become the most popular programming language for data plane programming in industry and academia because, in part because it provides a High-Level Intermediate Representation [110].

SDN-enabled devices provide us with the ability to perform computation in the network, but there are still many challenges to overcome to make INC feasible and beneficial. For example, Yang et al. [127] showed that it depends on the application whether INC is beneficial or not. In some cases, existing INC solutions can be counterproductive and even degrade performance.

According to Cooke and Fahmy [51], INC is more effective when the edge (data source, e.g., IoT device, sensor with microcontroller) and intermediate nodes (in-network resource, e.g., cloudlet, switch) have comparable capabilities to the central node (e.g., cloud server). In their case study, they found that as long as the computational capabilities of the intermediate nodes are about one-fifth of the central node, they still outperform the central node. However, as the gap becomes larger, the central solution becomes preferable. They argue that it is unlikely to close this gap with traditional software solutions. They argue that a more sensible approach would be to use hardware accelerators such as FPGAs at the edge and intermediate nodes to bridge the gap between powerful but inefficient CPUs at the central node, while being more cost and energy efficient.

Hu et al. [69] propose a paradigm that moves away from handling data at the packet level and considers all network nodes, including central nodes, as a unified environment based on network hardware virtualization and containers. This would enable a task to be placed on any node and therefore allow a greater degree of flexibility. In their case study for data aggregation, query, and monitoring, they showed that the ingress traffic to the cloud server and the overall energy consumption can be significantly reduced using INC. However, for queries, INC is only feasible when queries are repeated more frequently and the hit rate increases.

5 Taxonomy

Based on the definitions discussed previously, we propose the following taxonomy for programmable and reconfigurable network devices (Figure 2).

Fig. 2.

Network Device Type. As in the classic taxonomy, we can categorize the network devices based on their location and function in the network. However, the focus of this work is on (programmable/reconfigurable) NICs and switches. NICs are located on the host device, for example, a server or a PC. A switch, on the other hand, is located inside the network and connects all devices in the network with each other. This means that while the NIC generally has to handle the host-specific traffic from and to the host device itself, the switch on the other hand has to handle a higher traffic load as well as traffic from a larger number of sources and tenants. This is an important fact to consider when offloading functions for INC in order to use the (specialized) compute engines efficiently.

Programming Language/Model. Over time, various programming models, languages, and frameworks for SDN have emerged, such as Frenetic, NPL, P4, Data Plane Development Kit (DPDK), and eBPF. While some frameworks use a General Purpose Language (GPL) such as C, they have limitations in terms of supported constructs. eBPF, for instance, does not support global variables. Domain Specific Languages (DSLs) such as P4 were developed directly for a specific execution model and generally do not provide constructs supported by the GPL like C.

Architecture. The architectures of programmable/reconfigurable network devices can vary greatly in their implementation details (Section 8). While most of the designs discussed in this work are used for NICs, some of these chips can and are also used for programmable switches, such as the AMD/Pensando Capri and Elba architecture [59], which is used for both NICs (so-called Distributed Service Card (DSC)) and switches (Aruba CX 10000 Series [3]). It should be noted that we categorize the architectures based on the data plane. This means, for instance, that an SoC with an FPGA can also have a multi-core Processing System (PS), but the execution of the data plane takes place primarily on the Programmable Logic (PL). We therefore do not categorize such devices as multi-core, but as Reconfigurable SoC (RSoC).

6 Programmable Switches and SmartNICs

According to the SmartNICs Summit [29], a SmartNIC is a NIC that has its own processing capabilities and can perform tasks independently of the host device. It is an accelerator that improves the performance and/or flexibility of tasks such as packet processing, security, and analytics. A SmartNIC by this definition has no specific architectural characteristics regarding which components process the (offloaded) data. For example, it can be a multi-core or many-core system (with cores optimized for network processing), have homogeneous or heterogeneous Computing Units (CUs), be equipped with an accelerator such as an FPGA, or have accelerators implemented as an ASIC. A similar definition as for the SmartNIC can also be applied to programmable/smart switches. While programmable network devices can be used to offload server tasks, they are not capable of running entire cloud services on their own. However, many of today’s cloud services are made up of many more or less loosely coupled micro-services that can be offloaded to SDN-enabled network devices. Liu et al. [84] investigated the offloading of microservices onto Marvell/Cavium SmartNIC to improve energy and cost efficiency in data centers. IPSec¹ BM25, Recommend, and NIDS could benefit from the hardware accelerators available on the SmartNIC, and NATv4, Count, EMA, KVS, Flow Monitor, and DDoS could benefit from fast memory interconnect. Their work shows that microservices can benefit from dedicated hardware accelerators in addition to fast onboard memory interconnects. However, providing hardware accelerators for a wide range of services would result in a large increase in hardware resources on the network devices. The changing requirements of the accelerators, as well as the fact that some may not be used by the microservices, would make such a design not advantageous and could lead to a large increase in the power requirements of the network devices themselves. Therefore, a possible design could be to provide the most commonly used hardware accelerators as (programmable) ASICs and provide embedded reconfigurable hardware accelerators such as FPGAs that can be reconfigured by the data center service provider on demand or even online depending on the service requested.

7 Programmability of Network Devices

In this section, we will first give a brief overview of the three most common technologies used for programming network devices. The first technology in this section will be P4, a DSL originally established for programming switches, but which can also be used for a variety of SmartNICs. The second, the DPDK, and the third, the XDP, are two different approaches that have been proposed to improve performance for packet processing on a host system, but can be offloaded to SmartNICs.

The P4 is a DSL for programming network devices. In the first definition of the standard P4\({}_{14}\), P4 described the processing of packets in a Reconfigurable Match Table (RMT) (sometimes also called Match-Action Table (MAT)) pipeline with programmable parser and deparser targeting Protocol Independent Switch Architecture (PISA) devices [44]. The main idea behind the introduction of the RMT model was to create a Reduced Instruction Set Computer (RISC)-like programmable pipeline without compromising performance by introducing high latency for processing [43]. Therefore, the RMT pipeline consists of Ternary Content Addressable Memory for ternary and SRAM for exact matches.

The three main goals of P4 are as follows [44]:

—

Field Reconfigurability: The network controller should be able to redefine the packet parsing and processing in the field, without changing the hardware.

—

Protocol Independence: The network device should not be restricted to a (set of) specific packet formats. This means that the network controller should be able to redefine how the parser extracts information from the packet header and how to process it.

—

Target Independence: The same P4 program should run on different target devices, independent of the underlying architecture.

While P4 gives us new ways to interact with the network, it uses a very restrictive model that does not include typical programming constructs such as loops (only parsers are allowed to contain bounded loops). The advantage is that the computational complexity of a P4 program is linear, making it easy to predict how packets will be processed, and the strict processing structure allows the model to guarantee, in principle, that packets can be processed quickly across a variety of network devices. Some limitations can be partially overcome by recirculating packets after they pass through the egress pipeline, or by resubmitting them from the queue (and/or buffers) after the ingress pipeline [16]. Although P4 is already restrictive, vendors are also customizing P4 for their architectures, which not only makes P4 even more restrictive than the standard already defines but also undermines the third goal of creating a target-independent programming language for the PDP [65]. With the introduction of the 2016 revision of P4, called P4\({}_{16}\) [18], P4 took the next step and went beyond PISA to run P4 routers and NICs in addition to switches. With this came the ability to call external objects and functions provided by the architecture. This is an important step for interacting with custom architectures implemented on, for example, an FPGA.

With the introduction of P4\({}_{16}\), the P4 Language Consortium has also defined the Portable Switch Architecture (PSA) and is working on the definition of the Portable NIC Architecture (PNA). The PSA [19] specifies the structure and common capabilities of network switches (illustration of possible packet flows in Figure 3), and the definition of PNA [17] aims to provide the same concept for NICs. In addition, the PNA describes the interaction between the host system and the NIC. However, the definition of PNA is still under development by the P4 Language Consortium (version 0.5 at the time of writing [17]).

Fig. 3.

The DPDK [5] is one of the most widely used frameworks for high performance packet processing. DPDK bypasses the kernel by moving control of networking from kernel space to the application in user space. This eliminates the overhead of kernel/user space context switching and the kernel network stack. To avoid starting from scratch, DPDK provides many libraries to speed up implementation. To avoid the overhead of interrupts, DPDK provides a Poll Mode Driver (PMD), which is designed to allow a core to use constant packet polling. In addition, DPDK provides drivers for Direct Memory Access (DMA), ML, and Crypto supporting devices from a variety of vendors such as NXP, NVIDIA, and Marvell. While DPDK provides fast packet processing, both of its main components (kernel bypass and PMD) have their drawbacks. First, the kernel bypass has security and isolation issues that have already been solved in the well-maintained Linux kernel, and second, the constant active polling of the PMD keeps the CPU core permanently under full load. Originally developed by Intel, DPDK is now under the umbrella of the Linux Foundation and is primarily supported for Linux. While an official Windows port of the DPDK exists, it is still under development at the time of this writing [5].

One of the first projects to provide a P4 compiler with DPDK integration was T4P4S [120]. The T4P4S compiler generates hardware-agnostic C code from a P4 program. T4P4S first compiles the high-level description in P4 into a target-independent low-level description (called Core) and maps this description to the hardware-dependent functions (memory and queue initialization, persistent state/stateful memory management, I/O management and scheduling, packet manipulation, and metadata initialization/modification) using a Hardware Abstraction Layer (HAL) library (called NetHAL). Note that in [120] they only consider P4\({}_{14}\), but the newer version also supports P4\({}_{16}\) [27].

P4-DPDK [15] provides a DPDK integration into the P4 environment that is directly supported by the P4 community and, with the Infrastructure Programmer Development Kit [8], is an integral part of a project under the umbrella of the Linux Foundation. The P4-DPDK Software Switch (SWX) pipeline provides a SWX implementation of the P4 pipeline that uses the specification file created by the P4Compiler (P4C) DPDK backend to set up the (VM-like) software pipeline. The p4-dpdk-target uses the JSON files created by the P4C DPDK backend to define the front-end interfaces and the mapping of the P4 runtime information to the DPDK target-specific information. However, at the time of writing, only the ingress pipeline implementation is supported in the current release.² Components such as Packet Buffer and Replication Engine (PRE) or Packet Buffer (PB) (Figure 3) and features such as timestamps or checksums are not supported.

While P4-DPDK and T4P4S have the same goal, they have different approaches. T4P4S uses a low-level Intermediate Representation (IR) and maps it to the hardware, hiding the implementation details via a HAL.

The eBPF [6, 103] is a lightweight RISC-like VM inside the Linux kernel. It allows user-defined low-level programs to run in a sandbox within the kernel. It thus enables the extension of the kernel’s capabilities without having to modify the kernel’s source code or load kernel modules. eBPF provides portability, flexibility, and security. A verification step checks that the program cannot intentionally or unintentionally harm the system, such as causing crashes, accessing out-of-bounds memory, or having an infinite loop that prevents the program from completing. A Just-in-Time (JIT) compiler then translates the bytecode into the machine-specific instruction set. eBPF is well supported and used by a large community as well as by the industry, for instance by NVIDIA. eBPF is partially supported by DPDK through a BPF library that allows eBPF bytecode to be executed within the user space of the DPDK application [21]. However, at the time of writing, the current version (23.07.0-rc2) does not support features such as external function calls for 32-bit platforms, and JIT is only supported for 64-bit x86 and ARM processors.

The XDP [68] was introduced to provide fast packet processing without losing features such as security mechanisms, management tools, and isolation provided by the OS. Unlike frameworks like DPDK that use kernel bypassing, XDP works at the lowest level of the Linux network stack, exposing a so-called hook to which an eBPF program can be attached, allowing the eBPF program to make early decisions on incoming packets without the overhead of the entire network stack. Therefore, in contrast to DPDK, XDP is transparent to the host. However, as network load increases, so does CPU usage.

Because DPDK constantly polls for packets while XDP uses processor interrupts, DPDK generally has a lower latency than XDP. For low-rate packet processing, the difference can be one to two orders of magnitude on average. However, for high-rate packet processing, the advantage of DPDK shrinks to less than a factor of 2.5 for latency.

This means that DPDK is advantageous for services that provide rapid response where low-latency packet processing is required, such as in real-time human–machine interaction applications. XDP, on the other hand, allows the use of the well-established and well-maintained features of the Linux kernel, reducing implementation and maintenance costs.

Like DPDK, P4 provides a compiler backend for eBPF programs [20]. However, at the time of writing, the current backend only provides support for packet filtering. P4 features like multicast, ternary table matches, or parser containing cycles are not supported yet.

The AMD/Xilinx Nanotube compiler and framework [10] allows eBPF/XDP code to be compiled into a processing pipeline in High-Level Synthesis (HLS) C++ that can be synthesized with Vitis HLS.

For the interested reader, a more detailed overview of eBPF and XDP can be found in [119].

8 Architectures

In this section, we will discuss different architectural approaches for SDN devices that have been proposed over the years. While the proposed designs target the same domain, they differ significantly in their design. Some have limited programmability, while others offer GP cores. Many proposals that use GP cores are marketed as Data Processing Unit (DPU) and authors such as [39] distinguish between SmartNIC and DPU. The distinction made by [39] is based on the fact that, unlike previous SmartNIC such as the ConnectX cards or SmartNICs equipped with Netronomes NPUs, the Bluefield DPU is able to run control plane applications in addition to data plane processing. However, the control plane can be logically centralized and physically distributed, as discussed in Section 3, and it is not specified that the controller must be located on a server. This means that multiple physically distributed single device control planes are conform with the SDN paradigm and can be realized on any capable NIC or switch. This is also conform with the definition presented in Section 6. Therefore, we would argue that DPUs are a separate category of network devices, but rather categorize them as a possible chip design on a SmartNIC (Figure 4 provides an overview).

Fig. 4.

In the following, we will first discuss the use case of GPP (Section 8.1) for programmable network devices and then move on to programmable ASICs (Section 8.2), followed by FPGA (Section 8.3) and SoC (Section 8.4) designs. We conclude this section with a discussion of CGRAs and JIT, which to the best of our knowledge have not been discussed in the context of network architectures for INC, yet.

8.1 GPPs

As the GPPs architectures used in commodity servers offer a high degree of flexibility when processing data packets, they are widely used in data centers. However, their high flexibility comes at the price of low energy efficiency and latency that is difficult or impossible to predict. They are generally not able to provide the performance of inline packet processing alone. Therefore, they are typically not used as standalone processors for the data plane, but for control plane processing and slow path offloading instead. They are either connected to an accelerator as a standalone processor via a (PCIe) interface or are an integrated component/subsystem of an SoC. An example of a standalone processor together with an accelerator is the Cisco 3550-T. The Cisco 3550-T switch has all of its 48 25G SFP28-ports directly connected to an AMD/Xilinx Virtex UltraScale Plus FPGA with 8 GB High-Bandwidth Memory (HBM) 2 for data plane processing. The FPGA is connected via a PCIe Gen3 x8 interface to an Intel Atom processor with 8 cores and 16 GB DDR4 for control plane processing.

While the advantage of such a design is that all components are already available and therefore no custom chip design is required, the disadvantage is that the data transfer between accelerator and processor is limited by the interface (e.g., about 63 Gb/s data transfer without protocol overhead for PCIe Gen3 x8) and each transfer has a communication overhead that increases latency. Therefore, GPPs (with the exception of slow-path offloading) are not used for data plane processing on these devices. The trend is to use GP cores tightly coupled with dedicated (network) accelerators on a single SoC. Typical processor cores for this type of device are ARM or MIPS based. Further details on such designs are discussed in Section 8.4.

8.2 ASICs

An ASIC is a chip that is specialized and optimized for a specific task. Hardware developers focus only on implementing the minimum number of operations required to accomplish the task at hand. ASIC-based designs have the advantage of being able to perform their task with order of magnitudes better energy-to-performance ratio than a GPP [83], but they lack any flexibility. Since they also cannot be updated or fixed once they are manufactured and shipped, they have long and costly development cycles to ensure the correctness of the design.

Before the introduction of SDN and NFV, ASIC-based network devices with functionality tightly coupled to hardware were the state of the art—not only for power, but also for performance reasons. Earlier GPPs were simply not capable of processing packets at line speed, and in many cases still are not. As a result, many ASIC-based network chips use GPPs only for control plane functionality or to handle unusual/special cases of data plane packet processing (slow path offloading), while fast packet processing on the data plane packet is still implemented with ASICs. However, some ASIC-based network chips, such as the Intel Tofino (formerly Barefoot Tofino) [34], do offer some flexibility. These so-called programmable ASICs provide a basic level of flexibility by using MAT pipelines inspired by the RISC architecture. In addition, such designs are often equipped with programmable parsers and deparsers to enable protocol-independent packet processing according to the PISA architecture design. This allows network operators to define and use their own protocol, or switch to a different protocol if required, without having to replace expensive hardware, unlike fixed-function network chips.

8.3 FPGAs

FPGAs are reconfigurable devices consisting of Configurable Logic Blocks connected by a Switch Matrix. In general, any digital circuit can be designed on an FPGA, as long as enough resources are available. The downside of this flexibility is that FPGAs are larger and the achievable frequency is typically lower compared to ASICs. To at least partially address these issues, FPGAs typically also contain slices of specialized components such as Digital Signal Processing units and Block-RAM (BRAM).

Before P4, Xilinx had already targeted network programmability for their FPGAs. They proposed their own packet processing language called PX [24] for SDNet. SDNet allows the import of custom IP cores implemented in VHDL and Verilog by defining the interface to the user engine [23]. In addition, a C behavioral description must be provided for simulation. However, with the increasing popularity of P4, a \(P4_{16}\) compiler has been introduced into the SDNet environment. The Xilinx adaptation of the P4 compiler generates a JSON file for the runtime control software and the PX program, which is compiled as before by the SDNet compiler, which generates the Hardware Description Language (HDL) module [70]. In addition, a SystemVerilog testbench is generated. With the P4 \(\rightarrow\) NetFPGA design flow the P4-SDNet compiler was integrated by the NetFGPA project [70] to provide easier programmability of their NetFPGA-SUME boards without the need for HDL knowledge. NetFPGA PLUS [7] introduced a codebase using parts of the AMD/Xilinx OpenNIC project [2] to support AMD/Xilinx Alveo 200, 250, and 280 boards. With the introduction of Vitis Networking P4 (VNP4) [53], AMD/Xilinx provides direct P4 integration, generating SystemVerilog files for simulation and synthesis from a P4 description using the P4C-vitisnet compiler.

Since P4 is not a Turing-complete language, it is not possible to express many constructs required for FPGA data plane programming. However, since the introduction of extern in \(P4_{16}\), it is possible to interact with objects and functions provided by the architecture. This means it is possible to configure the FPGA and connect functions exposed to the data plane. Nevertheless, many features of modern FPGAs, such as Dynamic Partial Reconfiguration (DPR), cannot yet be used. Also, the incomplete standard for SmartNICs in P4 is an existing problem related to portability. The OpenNIC project [2] only partially addresses these problems. First, as the authors state, the OpenNIC project is not a full-featured SmartNIC solution. Second, the project originated as a research project within Xilinx Labs and targets only Xilinx FPGAs.

Research to provide programmability for the data plane has made significant progress. However, many features are still missing or proprietary. Also, programming FPGAs for network processing is still challenging and still requires sometimes lot of manual work and expert knowledge. For example, while it is possible to use HLS with OpenNIC, the RTL wrapper still has to be created manually because the interfaces generated by HLS are not directly compatible with the interfaces expected by OpenNIC.

8.4 SoC

SoC is a collective term for any design that realizes a system of multiple heterogeneous components interconnected on a single chip to solve specific tasks. Over time, SoC architectures have been proposed and realized for many domains, such as mobile processors with integrated graphics units, integrated tensor units for ML, or NPUs for network processing. In this section, we will discuss several SoC solutions used for network processing. Since in our opinion the term SoC is too broad to describe the different architectural approaches, we use the following categorization for SoC on programmable network devices:

(1)

RSoC. An SoC that combines a PS consisting of GP cores with the PL of reconfigurable hardware such as an FPGA or CGRA. The cores of the PS are primarily used for control plane tasks.

(2)

Multi-Core SoC (MuSoC). An SoC that uses multiple high-speed GP cores as the main component. The cores are used for both data plane and control plane processing.

(3)

Many-Core SoC (MaSoC). An SoC with a large number of programmable but simple dedicated cores for data plane processing to achieve a high degree of parallelism. They may also be equipped with a single or small number of GP cores used primarily for control plane tasks.

All three SoC categories can be equipped with hard implemented accelerators, e.g., for cryptography.

The industry often uses other terms such as NPU, DPU, and Infrastructure Processing Unit (IPU). According to the SmartNIC Summit definition [29] and similar definitions from NVIDIA [31], a DPU should have the following characteristics:

—

Industry-standard, high-performance, software-programmable (multi-core) CPU.

—

High-performance network interfaces capable of parsing, processing, and transferring data at line-rate.

—

Flexible and programmable acceleration engines for applications such as ML, security, telecommunications, and storage.

While this definition generally fits the MuSoC well, we can fit most industry proposed DPUs into this category. However, other proposals, such as the DPU on the AMD/Pensando DSC, fit better into the MaSoC category in our opinion.

According to the definition provided by the SmartNIC Summit [29] and Intel, we see a strong overlap between DPU and IPU. The only difference is that the DPU definition emphasizes that industry-standard CPUs are used, while the IPU definition focuses more on the fact that the accelerators are hardened and tightly coupled to the dedicated programmable cores. Looking at the devices offered by Intel under the umbrella of this term, we can see that it was mainly defined in the context of the Mount Evans architecture [112]. However, other products offered by Intel under this term do not fit this definition. Intel also uses this term for the F2000X-PL (Intel Agilex 7 F) and C5000X-PL (Stratix 10 DX) boards. Both boards are SmartNICs with an Intel Xeon-D connected via an on-board PCIe interface to an FPGA-based SoC with a 64-bit quad-core Arm Cortex-A53. This contradicts the definition of an IPU, which requires the accelerators to be hardened. Table 1 shows the classification for some upcoming and already available SoC based on our categorization.

Table 1.

Class	Manufacture	Chip(-Family)	Language/Framework	Architecture
RSoC	AMD/Xilinx	Alveo U25N	Vitis (HLS, Verilog, VHDL, P4)	XCU25N SoC with FPGA fabric and 4x Arm Cortex-A53
	Intel	N6000-PL	DPDK, FlexRAN (vRan only), OPAE, OFS,	AGF014 SoC with FPGA fabric and 4x Arm Cortex-A53
		(N6010/6011)	Intel Quartus Prime Pro Edition
MuSoC	Microsoft/Fungible	S1	eBPF, C	16x MIPS64 R6 Cores
	Marvell/Cavium	OCTEON 10	P4, eBPF, DPDK	8–24x Arm Neoverse N2
	NVIDIA/Mellanox	BlueField-2	DOCA (SPDK, DPDK, P4, Netlink)	8x Arm Cortex-A72
		BlueField-3	DOCA (SPDK, DPDK, P4, Netlink)	8x or 16x Arm Cortex-A78
MaSoC	AMD/Pensando	Capri	P4, DPDK	4x Arm Cortex-A72, 112 MPUs^a
		Elba	P4, DPDK	16x Arm-Cortex A72, 144 MPUs
	Netronome	NFP-4000	P4, C, DPDK	Arm11-Core \(+\) 4 FPC; 48 PPC (In- and Egress); 5x 12 FPC^b
		NFP-6000	P4, C, DPDK	Arm11-Core \(+\) 4 FPC; 96 PPC (In- and Egress); 10x 12 FPC
	Marvell/Cavium	OCTEON III	P4, eBPF, DPDK	2x MIPS64 R5 \(+\) 48x cnMIPS64 v3
		(CN7890)
	Microsoft/Fungible	F1	eBPF, C	4x MIPS64 (2xSMT) \(+\) 8x 6 MIPS64 (4xSMT)

Table 1. Examples of SoC Classified Based on the Proposed Categorization

On both Capri and Elba, the Media Access Control (MAC) units are connected to a central packet buffer. The Match Processing Units (MPUs) are organized into four pipelines. The ingress and egress pipelines are connected to the central packet buffer. In addition to the ingress and egress pipelines is a pipeline from the host direction (TxDMA) and a pipeline to the host direction (RxDMA) are also provided [60]. Both TxDMA and RxDMA conceptually are similar to the ingress and egress pipelines with DMA engines and scheduler (TxDMA) instead of parser/deparser.

Netronomes Network Flow Processor (NFP) is based on Intel’s IXP28xx Network Processor (NP) architecture [55]. The main components of the SoC are Micro-Engines (MEs), which are software programmable multi-threaded 32-bit RISC cores. The MEs are divided into two classes, Flow Processing Cores (FPCs) and Packet Processing Cores (PPCs) [11] grouped into islands. The PPC island is available in both ingress and egress direction and directly connected to the MAC units. Packets are processed by one or more PPCs groups in a pipeline [11].

8.5 CGRA

To the best of our knowledge, CGRAs in the context of programmable network devices have not yet been realized as own chip design and are not widely discussed in this context. The only exception is [113], which proposed a CGRA-based prototype on an FPGA for the data plane of programmable switches (details are discussed in Section 9.3). Nevertheless, we want to discuss this architecture, which has gained recognition as an alternative design for reconfigurable hardware compared to FPGAs for some domains, such as ML.

Research on CGRAs has been ongoing for about 30 years, dating back to the early 1990s [48, 66]. Unlike the fine-grained architecture of FPGAs, CGRAs are, as the name implies, a coarse-grained architecture. The higher granularity allows the construction of optimized Processing Elements (PEs) that require less area and power and can operate at a higher clock frequency. These PEs (also called Reconfigurable Cells or Functional Units (FUs) depending on the literature) are the smallest reconfigurable unit of a CGRA and are interconnected by programmable switches.

The higher degree of granularity allows reconfiguration times at a range of \(ns\) to \({\mu}\textrm{s}\) compared to FPGAs with reconfiguration times in the range of \(ms\) to \(s\). Therefore, temporal computation can be better exploited with CGRAs than with FPGAs [83]. However, the higher degree of granularity is accompanied by a loss of flexibility. Both CGRAs and FPGAs belong to the data-flow and configuration-driven architectures in the context of their execution model. Because CGRAs are less flexible, they are also considered part of a Domain Specific Accelerator classification [83], while FPGAs provide general flexibility. Even though CGRAs seem to be a good compromise between ASICs with fixed functionality and little to no flexibility and FPGAs with a high degree of flexibility, it is still an immature architecture that has to deal with a variety of problems. First, there are no high-level compilers that are well optimized for this architecture. The gap between manually and compiler optimized code is large. Liu et al. [83] suggest that it would be more efficient to use a lower-level programming model that exposes more architectural details to the programmer rather than relying on compiler optimization. However, this approach is not practical for many real-world (commercial) scenarios where time-to-market and low-cost development are more important than achieving the highest possible performance for the architecture. Therefore, we believe it is more important to develop compilers that are able to leverage the architectural features of CGRAs, for example, based on the Multi-Level Intermediate Representation (MLIR) compiler infrastructure [80] and providing libraries with optimized components, which is also discussed in [115].

Another approach that tries to take advantage of the fast reconfigurability of CGRAs is to use CGRA-like overlays for FPGAs. Jain et al. [72] proposed a coarse-grained overlay for JIT compilation on FPGAs by using the Clang compiler to transform OpenCL kernels into an LLVM IR to extract the Data Flow Graph (DFG). The nodes of the DFG are then mapped to the FUs of the overlay. This is used for FU netlist generation and Place and Route (PnR) on the overlay. Their proposed design improved PnR time by orders of magnitude, from several hundred seconds when using the FPGA fabric to a fraction of a second when using the overlay on a workstation. Even performing PnR on the ARM Cortex-A9 of the Zynq 7020 SoC took less than a second in most cases.

Zamacola et al. [130, 131] proposed a multi-grain overlay for FPGAs based on the CGRA-ME framework [49]. Their proposal builds on their previous work called IMPRESS [129], which supports multi-grain granularity for Xilinx 7 series FPGAs using DPR. In [130], they extend their work by integrating a modified version of CGRA-ME for their mapping tool. Like [72], they use the Clang compiler to transform the code into an LLVM IR to extract the DFG, which is then mapped onto the overlay. However, while [72] uses only a coarse-grained overlay, [130] supports two levels of granularity: a fine-grained level for configuring the FUs and a medium-grained overlay for assembling (or stitching, as they call it) the FUs.

As discussed in Section 5, intermediate nodes have limited computational capabilities. Therefore, they require hardware accelerators to provide the performance needed to offload applications to the network. While programmable ASICs like the Tofino can provide a high degree of performance, they lack the flexibility to meet the needs of many applications. Building a custom ASIC is time-consuming and costly, and can only provide a limited number of accelerators that must be clearly defined in advance, which is not feasible in most real-world scenarios. While highly flexible reconfigurability is one of the main strengths of FPGAs when it comes to creating your own custom design without going directly to an ASIC, it comes (in addition to high design times) with the disadvantage of high reconfiguration time (\(ms-s\)). CGRAs, on the other hand, have a reconfiguration time that is orders of magnitude (\(ns-{\mu}\textrm{s}\)) lower than FPGAs [83].

In order to provide acceleration for a wide range of applications in a dynamic network environment, we believe that fast accelerator exchange will be mandatory for future networks utilizing INC. Therefore, we believe that more research in the direction of implementing CGRAs as an actual integrated architecture for network devices, as well as a coarse-grained or multi-level grained overlay for FPGA-based network devices using technologies such as JITs, could be beneficial for accelerating algorithms in the network.

9 Exploring FPGA-Based Designs for the PDP

In this section, we will present research focused on offloading applications to FPGA-based network devices. Section 9.1 presents research on SmartNIC and switch hardware designs independent of the application domain, Section 9.2 provides research on frameworks and compilers for automatic offloading of eBPF/XDP programs to SmartNIC, and Section 9.3 concludes this section by presenting research on application offloading to the network for different domains. Table 2 provides a summary of the work presented and Table 3 provides important results.

Table 2.

Scope	Proposal	Year	Target Platform	Details
DNS	P4DNS [124]	2019	Switch (FPGA)	—P4 implementation of DNS service. —Control plane communication via DMA. —Use of P4 \(\rightarrow\) NetFPGA workflow. —Comparison with Emu based on reported in [111]. —Provides more features than Emu DNS.
CP	Variant of PBFT [105]	2020	SmartNIC and FPGA	—Implementation and evaluation of different configurations for PBFT. —Study for SmartNIC offloading and implementing of PBFT completely on an FPGA.
Caching/KVS	LaKe [117]	2018	Switch/NIC (FPGA)	—HW implementation based on the concept of the Memcached system. —Multilevel cache architecture using on-chip and on-board memory. —Comparison with Emu based on reported in [111]. —Slightly lower cache-hit latency compared to Emu. \(\quad\circ\) Emu does not have a cache hierarchy.
	Emu [111]	2017	Switch/NIC (CPU, FPGA)	—C# library for Kiwi compiler. —C# code is transformed to Verilog code by Kiwi. —Additional comparison with NetFGPA and P4FPGA [121].
	PANIC [81]	2020	SmartNIC (ASIC/FPGA)	—Heterogeneous CUs (accelerator or processor) \(\quad\rightarrow\) Support of hardware acceleration and software offloads —CUs connected over a crossbar to a central hardware packet scheduler using a PIFO. —Uses Corundum’s [58] NIC driver, DMA engine, MAC, and PHY. —FPGA prototype and ASIC implementation analysis.
General/Multi-Domain	SuperNIC [82]	2024	SmartNIC (FPGA)	—Mapping of task DAG to reconfigurable regions on FPGA. —Reconfigurable regions/CUs are connected over a crossbar to a central scheduler. —CU can contain a chain of multiple tasks. Wrapper around each task for bypassing. —Subgraphs (virtual task chain) extracted from a tenant’s DAG are scheduled to tasks in CU. —Supports tenant CU sharing, replication, and virtual task chain parallelism. —Fairness mechanism based on space sharing using DRF approach [62] and time-sharing.
	MTPSA [109]	2020	Switch (CPU, FPGA)	—Proposals of security isolation mechanisms for the PSA architecture. —Separation of superuser- and user pipelines. Concept of roles taken from OSs. —Proposal to encapsulate user pipelines in superuser egress pipeline (between parser and MAT). —Encapsulated packet decapsulated by superuser- and processed by user pipeline. —User program opaque to superuser and other user (P4) programs.
	Terabit Switch Virtualization [102]	2021	Switch (ASIC-FPGA)	—Proposed single SoC solution with network switching logic as ASIC and (embedded) FPGA logic for adaptable packet processing and switch virtualization. —Based on their P4VBox reference design [101]. —Analysis of ASIC fabric and evaluation of parallel running network application on the FPGA.
	hXDP [46]	2020	SmartNIC (ASIC/FPGA)	—Offloading of XDP programs to a SmartNIC. —Custom VLIW processor with accelerators. —Custom compiler to optimize the eBPF byte code for the offload engine. —Implemented on FPGA, but fixed design \(\rightarrow\) can be implemented as ASIC.
eBPF/XDP Offload	hXDP \(+\) WE [42]	2022	SmartNIC (ASIC/FPGA)	—Extends hXDP with a MAT pipeline (Warp Engine (WE)) in front of the hXDP offload engine. —Custom compiler in front of the hXDP compiler: \(\quad\rightarrow\) Identifies parts to be offloaded to the MAT pipeline. \(\quad\rightarrow\) Rest is compiled and executed by the hXDP part. —Integration of hXDP (\(+\) Warp Engine) in Corundum and ported to Alveo U50. —Fixed latency of 28 clock cycles (112 ns @250 MHz) for WE.
	eHDL [97]	2023	SmartNIC (FPGA)	—Generation of a hardware pipeline based on the analysis of the eBPF byte code. \(\quad\circ\) Compiler translates eBPF byte code to VHDL code. —Unlike hXDP, resource utilization depends on the application (Figure 5)
	Taurus [113]	2022	Switch (ASIC, CGRA)	—Proposal for CGRA-based switch architecture for ML. —Prototype using Tofino switch for MAT pipeline and FPGA for CGRA implementation. —FPGA is connected over Ethernet with the Tofino switch. —CGRA used for MapReduce \(\rightarrow\) Can be bypassed for normal PISA flow. —Training on control and inference on data plane (Taurus).
Machine Learning	Homunculus [114]	2023	Switch (ASIC, CGRA)	—Framework for mapping ML models to supported switch targets (Tofino, P4-SDNet and Taurus). —Python Front-end: Provides functions that can be used to integrate existing libraries such as TensorFlow. —Middle-end: HyperMapper to optimize the configuration file. —Back-end: Generates code (Spatial [78] and P4).
	NetReduce [85]	2023	Switch (ASIC/FPGA)	—Accelerating aggregation in the network for ML. —Prototype: External FPGA attached to a commodity switch. —Discusses the implementation of the proposed design as ASIC. —Discussion of the limitations of (P4) programmable switches.
Security	Pigasus [132]	2020	SmartNIC (FPGA)	—Single server IDS/IPS. —Parser, reassembler, and MSPM on FPGA. —Regular expression and full match stages on host.

Table 2. Overview of the Proposals Discussed, Including Target Platform and Important Details

DNS, Domain Name Server; MSPM, Multi String Pattern Matcher; MTPSA, Multi Tenant Portable Switch Architecture; PBFT, Practical Byzantine Fault Tolerance; PIFO, Push In First Out; VLIW, Very Long Instruction Word.

Table 3.

Proposal	Evaluation Setup	Relevant Results
Emu DNS [111]	—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4-RAM \(\quad\circ\) NIC: Intel 82599ES (10 GbE) —NetFPGA-SUME: Emu —Traffic Source: OSNT [36]	—Throughput (Host): \(\color{green}\uparrow\) \(\approx 5.2x\) —Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/66.5x\) —Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/74.4x\)
P4DNS [124]	—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB RAM \(\quad\circ\) NIC: Solarflare SFC9220 (10 GbE) \(\quad\circ\) Software: NSD [12] —NetFPGA-SUME: P4DNS —Traffic Source: OSNT [36]	—Throughput (NSD): \(\color{green}\uparrow\) \(\approx 52x\) —Throughput (Emu): \(\color{green}\uparrow\) \(\approx 10x\) —Latency (\(99\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/54.2x\) —Latency (\(99\)th-perc., Emu): \(\color{red}\uparrow\) \(\approx 1.8x\) —Latency (\(50\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/36.7x\)
Variant of PBFT [105]	—Cluster: 24 machines, each Intel Xeon E-2186G \(\quad\circ\) 10 GbE network —Consensus group size: 15 nodes	—Acceleration of data movement and hashing are more beneficial than crypto only acceleration. —Optimal goodput depends on fine granular batching. —Area efficiency (Throughput/Area, Intel): \(\quad\circ\) \(\color{green}\uparrow\) \(1.5x\) (Packet Filter) - \(248x\) (KV)
Emu Memcached [111]	—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4 \(\quad\circ\) NIC: Intel 82599ES (10 GbE) —NetFPGA-SUME: Emu —Traffic Source: OSNT [36]	—Throughput (Host): \(\color{green}\uparrow\) \(\approx 2.2x\) —Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/20.1x\) —Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/22.7x\)
LaKe [117]	—Host: Intel Core i7-4770, 64 GB RAM \(\quad\circ\) Software: Linux Memcached —NetFPGA-SUME: LaKe —Traffic Source: OSNT [36]	—Throughput (Host): \(\color{green}\uparrow\) \(\approx 13.6x\) —Throughput (Emu): \(\color{green}\uparrow\) \(\approx 6.8x\) —Latency (Cache Hit, Host): \(\color{green}\downarrow\) \(\approx 1/205x\) —Latency (Cache Hit, Emu): \(\color{red}\rightarrow\) \(\approx 1x\) —Latency (Cache Miss, Host): \(\color{green}\downarrow\) \(\approx 1/42x\) —Latency (Cache Hit vs. Miss, LaKe): \(\color{red}\uparrow\) \(\approx 4.6x\)
PANIC [81]	—Server 1: Dell PowerEdge 640 \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) \(\quad\circ\) Data Source: DPDK custom packets —Server 2: Dell PowerEdge 640 \(\quad\circ\) ADM-PCIE-9V3 (VU3P-2 FPGA): PANIC \(\quad\circ\) Use open-source IPs as CUs: RISC-V @250 MHz, AES-256 @250 MHz, SHA-3 @150 MHz —ConnectX-5 directly connected with PANIC	—Achieved frequency: 250 MHz \(\quad\circ\) Exception PIFO: 125 MHz —11.27% LUT utilization \(\quad\circ\) High utilization by PIFO —8.94% BRAM utilization —Achieved throughput: 100 Gbps (linerate) \(\quad\circ\) Maximum on-chip bandwidth: 256 Gbps —Latency: Scheduling, load and CU dependent \(\quad\circ\) \(\lt 16\,{\mu}\textrm{s}\) in evaluation
SuperNIC [82]	—Evaluation using Verilator Simulation (\(s\)) and Testbed (\(t\)). —Testbed (KVS use case): Cluster connected over a 100 GbE switch \(\quad\circ\) SuperNIC: HiTech Global HTG-9200 (AMD/Xilinx VU9P, 9x100 GbE) (\(A^{}\)) \(\quad\circ\) 2 Server: Dell PowerEdge R740 (Xeon Gold 5128) with NVIDIA/Mallenox ConnectX-4 NIC (100 GbE) (\(B\): Clover [118], \(C\): HERD [74]) and NVIDIA/Mallenox BlueField-Gen1 NIC (100 GbE) (\(D\): HERD [74]) \(\quad\circ\) AMD/Xilinx ZCU106 Evaluation-Board (10 GbE): Clio [63] (\(E\)) —Re-implementation of PANIC [81] to HTG-9200 as baseline. —Evaluation: Chains with dummies (\(d\)) and chains consisting of tasks (\(r\)), including: Firewall, KV-Cache, NAT, load balancing, forwarding, and AES. —Parallelism: \(S1\) \(=\) None, \(S2\) \(=\) DAG Parallel, \(S3\) \(=\) \(S2\) \(+\) Instance Parallel* \({}^{*}\) \(A+E\) here with caching NT (also included in the paper without)	—Achieved frequency for most modules: 250 MHz —Minimum latency (ingress to egress): \(1.3{\mu}\textrm{s}\) —Beneficial to have short running NTs together in a single CU. —Time sharing improves area utilization compared to DRF [62] only. —Achieved throughput: 100 Gbps (linerate) —Throughput (\(s/d\), \(S1\) vs. \(S2\)): \(\color{red}\rightarrow\) \(\approx 1x\) —Throughput (\(s/d\), \(S1/S2\) vs. \(S3\)): \(\color{green}\uparrow\) \(\approx 1x-1.5x\) —Latency (\(s/d\) and \(s/r\), PANIC): \(\color{green}\downarrow\) \(\approx 1x-0.6x\) —Throughput (\(s/r\), PANIC): \(\color{red}\rightarrow\) \(\approx 1x\) — \(A+E\) vs. \(E\): Lat. \(\color{green}\downarrow\) \(\approx 0.6x-0.8x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\) — \(A+E\) vs. \(B/C\): Lat. \(\color{green}\downarrow\) \(\approx 0.5x-0.6x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\) — \(A+E\) vs. \(D\): Lat. \(\color{green}\downarrow\) \(\approx 0.3x-0.4x\); Thp. \(\color{green}\uparrow\) \(\approx 2.9x-3.8x\)
MTPSA [109]	—SW-Target: BMv2 (Functional evaluation in Simulation with Mininet.) —HW-Target: NetFPGA SUME (L2-Switch) \(\quad\circ\) Traffic Source: OSNT [36] \(\quad\circ\) Packet size: 64B–1,518B (Results only for 64B reported). —Comparison with P4 \(\rightarrow\) NetFPGA PSA as reference design. —Comparison with MTPSA without (MTPSA\({}_{0}\)) and up to eight user programs (MTPSA\({}_{x},x\in\{2,3,4,8\}\))	—Achieved maximum throughput of NetFGPA SUME (40 Gbps). —Latency (Ref. \(\rightarrow\) MTPSA \({}_{0}\):): \(\color{red}\uparrow\) \(1.7{\mu}\textrm{s}\rightarrow 2.52{\mu}\textrm{s}\) —Latency (MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\)): \(\color{red}\uparrow\) \(2.52\mu\textrm{s}\rightarrow 3.23{\mu}\textrm{s}-3.3{\mu}\textrm{s}\) —Relatively stable latency \(\rightarrow\) (Mainly) user program dependent. —Ref. \(\rightarrow\) MTPSA \({}_{0}\): Logic \(\color{red}\uparrow\) 10%; Memory \(\color{red}\uparrow\) \(7.65\%\) —MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Logic, per program): \(\color{red}\uparrow\) \(5.9\%-7.4\%\) —MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Memory, per program): \(\color{red}\uparrow\) \(5.4\%-6.3\%\)
Tb Sw. Virtual. [102]	—ASIC logic (65 nm): Synopsis Design Compiler —FPGA logic: AMD/Xilinx XCVU13P \(\quad\circ\) Generated vSwitch instances with P4 \(\rightarrow\) NetFPGA. \(\quad\circ\) Use case: L2-Switch (26x), Firewall (17x), Router (14x), INT (14x)	—Frequency (ASIC): 1 GHz; Total Area: \(47.6 \mathrm{mm}^{2}\); Power: \(28.3 \mathrm{W}\) —Frequency (FPGA): 718.4 MHz —Throughput (per instance): \(129.61-132.63\) Gbps —Possible throughput (total): \(\approx 1.43-3.45\) Tbps (Max: \(3.2\) Tbps)
hXDP [46]	—Host: Intel Xeon E5-1630 v3 —Netronome NFP-4000 @800 MHz: XDP offload —NetFGPA-SUME: hXDP @156 MHz —Evaluation applications (Firewall and Katran) and Microbenchmarks. —Evaluation Packet Forwarding: 64B–1,518B	—\(\approx\) 18% of logic resource utilization —Throughput (Applications): \(\quad\circ\) Host @2.1 GHz: \(\color{green}\uparrow\) \(\approx 1.08x\) – \(\approx 1.55x\) \(\quad\circ\) Host @3.7 GHz: \(\color{red}\downarrow\) \(\approx 0.88x\) – \(\approx 0.62x\) —Forwading Latency (64B–1,518B): \(\quad\circ\) Host @3.7 GHz: \(\color{green}\downarrow\) \(\approx 1/8.3x\) – \(\approx 1/10.7x\) \(\quad\circ\) NFP-4000: \(\color{green}\downarrow\) \(\approx 1/1.1x\) – \(\approx 1/3.5x\) —hXDP has higher throughput for TX- or redirection.
hXDP\(+\) WE [42]	—AMD/Xilinx Alveo U50 @250 MHz: hXDP —AMD/Xilinx Alveo U50 @250 MHz: hXDP \(+\) WE	—Logic (hXDP): \(\color{red}\uparrow\) \(\approx\) 51.4% (\(\approx\) 13.3% total) —Memory (hXDP): \(\color{red}\uparrow\) \(\approx\) 43.5% (\(\approx\) 11.5% total)^a —Instruction reduction (for hXDP): \(\color{green}\downarrow\) \(\approx\) 16.3% - 100%. —Throughput (hXDP only): \(\color{green}\uparrow\) \(\approx 1.2x-3.1x\) \(\quad\circ\) Complete offload of Suricata to WE: \(\color{green}\uparrow\) \(\approx 18.2x\) —Latency (hXDP only): \(\color{red}\uparrow\) \(\approx 1.01x-1.1x\) \(\quad\circ\) Exception Katran (\(\color{green}\downarrow\) \(\approx 0.98x\))
eHDL [97]	—AMD/Xilinx Alveo U50: eHDL —AMD/Xilinx Alveo U50: hXDP —AMD/Xilinx Alveo U50: P4-SDNet —NVIDIA/Mallenox BlueField 2	—Throughput (SDNet)^b: \(\color{red}\rightarrow\) 1x —Throughput (hXDP): \(\color{green}\uparrow\) \(\approx 27.4x-164.4x\) —Throughput (BlueField 2, 4 Cores)^c: \(\color{green}\uparrow\) \(\approx 11.7x-23.4x\) —Latency (Avg., hXDP): \(\color{red}\uparrow\) \(\approx 1.03x\) \(\quad\circ\) \(\color{green}\downarrow\) \(\approx 0.9x\) (Firewall) \(-\) \(\color{red}\uparrow\) \(\approx 1.2x\) (Router) —Latency^d (BlueField 2): \(\color{green}\downarrow\) \(\approx 0.1x\)
Taurus [113]	—2 Server, each with an Intel Xeon Gold 6248 @2.5 GHz \(\quad\circ\) MoonGen: Traffic generator and traffic sink. —Taurus Switch Prototype: \(\quad\circ\) Wedge 100 BF-32X (Tofino Switch) \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation) —Case study: Anomaly Detection \(\quad\circ\) Inference on the CGRA \(\quad\circ\) Baseline inference on the control plane \(\quad\circ\) 5 Gbps traffic \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps	—Estimated area overhead of \(3.8\%\) for ASIC implementation —Latency: \(\quad\circ\) Baseline: 34 ms (100 Kbps) - 512 ms (100 Mbps) \(\quad\circ\) Taurus (Avg.): 122 ns —Detection rate: \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 2.6% (1 Mbps) \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 58.2% (100 Mbps - 100 Kbps) —F1 score: \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 4.9% (1 Mbps) \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 71.1% (100 Mbps - 100 Kbps) —Online Training convergence fastest with higher sampling rate (\(10^{-2}\)), higher number of epochs (10) and small batch sizes (64).
Homunculus [114]	—2 Server: Intel Xeon Gold 6248 @ 2.5 GHz \(\quad\circ\) MoonGen: Traffic generator and traffic sink. —Taurus Switch Prototype: \(\quad\circ\) Wedge 100BF-32X (Tofino Switch) \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation) —Case studies: Anomaly Detection, Traffic Classification, and Botnet Chatter Detection. \(\quad\circ\) Ideal F1 score calculated offline in SW.	—Achieved line rate for all three applications. —Achieved ideal F1 score for all three applications.
NetReduce [85]	—Single-GPU: six machines each \(\quad\circ\) 2x Intel Xeon E5-2064 10x @2.4 GHz \(\quad\circ\) 3x 32GB DDR4 \(\quad\circ\) NVIDIA Geforce RTX 2080 8 GB \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) —Multi-GPU: four machines each \(\quad\circ\) 2x Intel Xeon Gold 6154 18x @3.00 GHz \(\quad\circ\) 16x 64GB DDR4 \(\quad\circ\) 8x NVIDIA Tesla V100 SXM2 32 GB, \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) —Evaluation: \(\quad\circ\) Image Classification (ImageNet dataset): AlexNet, VGG16, and ResNet50. \(\quad\circ\) NLP: BERT, GPT-2, MNLI, QNLI, QQP and SQuAD \(\quad\circ\) Comparison with RAR, SwitchML, FAR, and TAR.	—Throughput (Single-GPU): \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.05x\ -\approx 1.45x\) \(\quad\circ\) ImageNet, SwitchML: \(\color{green}\uparrow\) \(\approx 1x\ -\approx 1.25x\) \(\quad\circ\) NPL, RAR: \(\color{green}\uparrow\) \(\approx 1.22x\ -\approx 1.43x\) —Throughput (Multi-GPU): \(\quad\circ\) ImageNet, FAR: \(\color{green}\uparrow\) \(\approx 1.15x\ -\approx 1.69x\) \(\quad\circ\) ImageNet, TAR: \(\color{green}\uparrow\) \(\approx 1.12x\ -\approx 1.58x\) —Communication Improvement (Multi-GPU^e): \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.16x\ -\approx 1.34x\) —Accuracy Lost (Single-GPU) \(\quad\circ\) ImageNet, RAR: \(\color{red}\downarrow\) \(\approx 0.2\%\ -\approx 1.5\%\)
Pigasus [132]	—Intel i7-4790 @3.60 GHz: Snort 3 SW only —Intel i9-9960X @3.1 GHz: Snort 3 SW \(\quad\circ\) Intel Stratix 10 MX: Pigasus	—Number of Snort cores for 100 Gbps: \(\quad\circ\) IDS: \(\color{green}\downarrow\) 23–185x less than SW only \(\quad\circ\) IPS: \(\color{green}\downarrow\) 23–200x less than SW only —Latency (SW only): \(\color{green}\downarrow\) \(1/3x\) \(-\) \(1/10x\) \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\) —Estimated Power consumption: 49–166 W \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\)

Table 3. Overview of the Setups and Achieved Results of the Proposals Discussed

The first number is the relative increase in utilization compared to hXDP. Total resource utilization in parentheses.

No implementation for DNAT with SDNet \(\rightarrow\) Problem with dynamic port selection implementation in P4.

Only one and four core reported.

Latency for SDNet not reported.

Use of only one NVIDIA Tesla V100 per machine in this scenario.

KVS, Key Value Store; LaKe, Layered Key Value Store; LUT, Lookup Table.

9.1 SmartNIC and Switch Designs for Multi-Domain Applications on FPGAs

Lin et al. [81] present with PANIC a multi-tenancy SmartNIC design implemented on a ADM-PCIE-9V3 board equipped with a AMD/Xilinx Virtex UltraScale+ (XCVU3P-2) FPGA.

The incoming packet is received by a single-stage RMT module that is configured at design time. This means that, unlike the RMT stages of the PISA, it cannot be changed online. The module matches the IP address and port fields and maps them to a descriptor for further processing. The packet is then stored in the PB and the created descriptor is written to the scheduler’s Push-In First-Out (PIFO)-based priority queue [106].

The Central Scheduler supports both pull-based scheduling and push-based scheduling with chaining. The system also uses a credit-based system to keep track of the current workload of each CU. Depending on the workload, the scheduler operates in pull mode (high workload) or push mode (low workload). The push and pull mode used here uses the same concept established for push and pull mode in producer-consumer designs, such as those used in cloud messaging services. In pull mode, CUs actively request packets and must wait until the request is complete before they can start working. In push mode, the scheduler actively pushes packets to the CUs. Using chaining, a CU can push a packet directly to the next CU in the chain after processing if that CU’s buffer is not full, otherwise the packet is written back to the PB.

A CU can be a hardware accelerator or a processor core. The authors used two hardware accelerators (AES and SHA3) and a RISC-V softcore for their prototype. Both implemented hardware accelerators are based on IPs provided by OpenCores [13, 14]. For the softcore, the open source RISC-V core generator VexRiscv [26] was used. All components of PANIC are connected via a Crossbar build with the open source network-on-chip router RTL [40].

All components are implemented in Verilog and can run at a frequency of 250 MHz. The only exception is the PIFO which runs at 125 MHz. The reason for the lower frequency for the PIFO is, according to the authors, the poor scalability of the PIFO on FPGA. For the PIFO they could not use the BRAM of the FPGA and had to rely on Lookup Table (LUT) only, which has a strong impact on the available area utilization. With a data width of 512 bits, the crossbar can support a throughput of up to 128 Gbps per port, exceeding the 100 Gbps of the MAC. To transfer packets to the host memory, PANIC uses the DMA engine provided by Corundum [58]. They also use the Corundum NIC driver on the host side.

SuperNIC, proposed by Lin et al. [82], shares similarities with PANIC [81]. Both investigate task offloading for multi-tenants to CUs on a SmartNIC. In both designs, CUs are interconnected with each other and a credit-based central scheduler over a crossbar, and they support task chaining between CUs. However, there are several key differences.³

First, PANIC focused on prototyping a SmartNICs on FPGA with a general fixed architecture, potentially implementable as an ASIC. In contrast, SuperNIC consists of three components: (a) FPGA logic for user-offloaded network computation; (b) ASIC logic for fixed system tasks (packet transmission, scheduling, and reception); and (c) GP cores for executing control plane tasks in software. While the latter two components are common in many SmartNICs, Lin et al. [82] primarily focused on offloading network computation to the FPGA logic, in the form of NTs described by a Directed Acyclic Graph (DAG), including the reconfigurable nature of FPGAs in their consideration.

The second major difference relates to a key concept in SuperNIC’s design: the abstraction of NTs into Virtual NT chains. While PANIC provides a single CU for each task, leading to limited scalability as the number of tasks grows, SuperNIC maps a chain consisting of multiple connected tasks (physical NT chain) to a single reconfigurable CU, called NT region. A virtual chain is a subset of tasks from a DAG. However, while a virtual chain can contain tasks available in a CU others may not be part of it. To avoid frequent reconfiguration or overly fine granularity (which would lead back to PANIC’s scalability issues), the authors propose the concept of bypassing or “skipping” tasks in a chain. To support this mechanism, each task in a CU is augmented with a wrapper.

For better performance and utilization, tenants can also share a CU, i.e., if a task in a CU is not needed by one tenant, it can be used by another tenant at the same time. In addition, SuperNIC supports execution of multiple virtual chains belonging to a single DAG (DAG Parallelism) in parallel to reduce execution time and replication of CUs (Instance Parallelism) to increase throughput.

To ensure fairness, they provide the concepts of space sharing and time sharing. For space sharing, they determine the number of instances of an NT chain to start and the amount of onboard memory to allocate to each user, using the DRF approach [62]. For time sharing, the allocated resources from the space sharing step are considered fixed, and a virtual start and end time is assigned to each packet, allowing time sharing of CUs, ingress and egress bandwidth, and buffers.

Unlike NICs, switches are connected to a much larger number of network nodes, making them ideal targets for in-network application acceleration. However, adding compute logic can impact not only individual nodes but also overall network performance for traffic routed through the switch. Additionally, switches typically manage a larger number of tenants compared to NICs.

Stoyanov and Zilberman proposed Multi-Tenant Portable Switch Architecture (MTPSA) [109], an extension for the P4 PSA with OS-like roles and privileges for security isolation. The pipelines generally follow the PSA definition. However, the ingress and egress pipeline belongs to the superuser (typically the network operator), and the egress pipeline is extended by an encapsulated user (sub-)pipeline between parser and MAT stages of the superuser egress pipeline. Superuser header vector and metadata bypass the user pipeline. Encapsulated user packets (e.g., using VXLAN) are decapsulated by the superuser pipeline and processed by the user pipeline. This allows opaque processing of user programs without interference or knowledge by or from the superuser pipeline or other user programs. They extended P4C with a back-end compiler for BMv2 SW-target to generate superuser and user pipelines separately. Due to SDNet limitations, they adapted the design for NetFPGA, placing the user pipeline before the superuser egress pipeline.

Saquetti et al. [102] proposed an SoC consisting of ASIC and FPGA logic components based on their prior work P4VBox [101] for data plane virtualization. The core of their proposed architecture is an array of reconfigurable vSwitch realized on the FPGA logic. The input and output queues, buffers, interfaces, and vSwitch queues are implemented as ASIC. Additionally, the SoC provides a management interface to an external off-chip controller. In their evaluation, they achieved up to \(3.2\) Tbps for the ASIC implementation and \(129.61-132.63\) Gbps for a single vSwitch instance. However, they were constrained by the limited on-chip BRAM of the XCVU13P FPGA. This limitation allowed them to saturate the maximum bandwidth of \(3.2\) Tbps only for the vSwitch implementation of the L2-Switch. For the other evaluated use cases (Router, Firewall, INT), they achieved approximately \(1.43-2.23\) Tbps (about 45–70% of the maximum bandwidth) only.

Discussion. In most real-world scenarios, serving multiple tenants is necessary. Given that resources on SmartNICs or switches are much more limited compared to servers, it is crucial to investigate how to implement accelerators for specific tasks or domains. Additionally, it is important to ensure a high degree of fine-granularity and reusability in their design.

However, incorporating dedicated accelerators in a design increases the scheduling complexity when mapping tasks efficiently in a dynamic, multi-tenant environment. Therefore, it is necessary to investigate intelligent scheduling and placement methods, as demonstrated in [81] and [82]. The approach in [82] offers some advantages over [81], as it can map multiple tasks into a CU, potentially allowing better utilization of resources of a Pipeline Register (PR) and requiring less interaction with the scheduler. However, the authors of [82] currently only support manual mapping of physical chains to a CU, leaving automatic mapping as a topic for future research.

Works like PsPIN [53] and Flare [52] (based on [53]) have investigated the implementation of RISC-V-based MaSoC for SmartNICs and switches, respectively. These designs provide advantages such as flexibility, high performance through a high degree of parallelism, and easy scheduling due to the homogeneous design of the PEs (in contrast to [81, 82]). However, the scalability of such designs may be limited as the number of clusters and PEs in a cluster increases, due to interconnection complexity and limited access to shared memory. To address these limitations, such designs could be complemented by fine-granular hardware accelerators that can be shared within and/or between clusters. This approach, however, would potentially result in increased scheduling complexity again.

While both [102] and [82] discuss PR, [102] also includes a simple model regarding the effect of PR, during the initial (re-)configuration time of the CUs. PR provides the flexibility to change specific regions/CUs, but this typically comes at the cost of increased resource utilization, limiting the design space for (global) optimizations and potentially leaving partitions underutilized. Furthermore, it can lead to higher latencies when elements interact between partitions [102]. To address this, techniques like task chaining of offload to a single Reconfigurable Partition (RP), like investigated in [82], could provide an approach for better utilization of the RP. However, only a limited number of works, such as [82, 90, 102] have considered or investigated RP for SDN and INC. Furthermore, technologies like Nested Dynamic Function eXchange [32] still remain to be investigated in this context.

9.2 Investigation of SmartNIC Designs for eBPF/XDP Offloading to FPGAs

Brunella et al. [46] presented hXDP, a solution for accelerating eBPF/XDP programs on a NetFPGA. They implemented an IP consisting of a Programmable Input Queue and an Output Queue connected to a finite state machine called Active Packet Selector (APS) containing a read/packet and write buffer allowing parallel read and write. The APS is connected via a data bus to an implementation of the extended Instruction Set Architecture (ISA) [4] on a custom 4-stage, 4x superscalar soft-core Very Long Instruction Word (VLIW) processor on the FPGA. To take advantage of the data parallelism inherent in some functions, the VLIW processor is connected via an additional bus to the helper function module used to implement the accelerators. The interface of the helper function module follows the eBPF definition. The accelerators can read data directly from the registers containing the arguments of the function call (R1–R5) and write the results back to the return value register (R0). To support the functionality of eBPF maps⁴ in hardware, a configurator for the shared maps memory region on the FPGA was implemented, which creates the maps at program load time. To support direct access to individual map entries, the maps memory, like the APS, is connected directly to the VLIW processor via the data bus.

In order to run unmodified eBPF code, they provide a compiler that translates the compiled eBPF bytecode to the design’s hXDP ISA. The custom compiler for their hardware target optimizes the eBPF program by removing unnecessary instructions for the FPGA target, such as boundary checks and zeroing of memory areas for variable initialization. They also changed the ISA from a two-operand machine to a three-operand machine and added support for 6-byte alignment⁵ for load store instructions to fit the field size of the MAC address in a packet. Since the implementation of the VLIW processor on the FPGA is much simpler from a design perspective than that of a modern x86 CPU that supports, for example, out-of-order execution with branch prediction and speculative execution, they use static compiler analysis for instruction-level parallelism and scheduling of branches to different lanes for parallel execution, instead.

Their work showed that it is reasonable to offload XDP programs to a NIC. Not only can this lead to better performance for networking applications while being more power efficient, but it can also free up CPU cores from specialized networking tasks that they weren’t designed for and use them for other more general tasks instead. The latest version of hXDP has been integrated into Corundum [58] and now targets a Xilinx Alveo U50 [42].

A follow-up work by Bonola et al. [42] extended the in Corundum integrated design of hXDP [46] by a fused parser-MAT pipeline, called Warp Engine (WE). The idea is that simple parts of an eBPF/XDP program, such as header parsing, can be offloaded to simple MATs. In addition to the WE, they extend the hXDP compiler flow by adding their own custom compiler, called the Warp Optimizer, in front of this flow. The Warp Optimizer first reads the bytecode of the program and creates a Control Flow Graph (CFG) to analyze which parts can be offloaded. For offloading to the WE, the compiler, similar to a P4 compiler, creates match-action rules to configure the MAT. The WE also provides a feature called context restoration. This means that the WE is capable in case it cannot run the whole program it will execute only parts of it and create a correct initial state for hXDP which will then process the rest. Also, the WE is able to process the next packet while hXDP is still busy with the current packet. This ensures that the WE does not become the bottleneck in the system. Tasks such as simple packet forwarding can be handled without the hXDP components being involved at all.

In general, the concept behind this idea is the same as fast-path processing and slow-path offloading as already known from PISA architectures. The difference here is that while slow-path offloading usually uses GPP, the hXDP architecture can also provide accelerators for complex rules.

An alternative approach to [46] and [42] is eHDL [97]. Both [46] and [42] provide a predefined architecture that is mapped/prototyped on an FPGA and the eBPF bytecode is compiled for this fixed target architecture. On the other hand, eHDL [97] also takes the eBPF bytecode as input, but generates a hardware pipeline from it instead. They integrate the generated eBPF hardware pipeline into the Corundum NIC. Like the hXDP compiler, eHDL merges the two-operand eBPF instructions into three-operand instructions, and like in [42, 46] a CFG and Data Dependency Graph of the program are generated. eHDL provides templates for the hardware primitives and the instructions to them. To support (conditional) jump instructions, the pipeline creates disable signals to disable the operation for the following stages by simply forwarding the packet to the following stages until the offset (in pipeline stages) from the current stage is reached, while hXDP schedules all dependent instructions on the same lane to avoid data hazards and uses data forwarding per lane to avoid stalling the pipeline for these dependent instructions. For possible Write-after-Read data hazards, the design generates delay registers that work on a FIFO principle. In the case of Read-After-Write (RAW) data hazards, they distinguish between two cases, per-flow and global state cases. For the per-flow cases, they insert an additional block that contains the addresses of read operations and compares them with the address of a write operation. If at least one of the addresses matches, the pipeline is flushed. To avoid repeated write operations to earlier maps in case the RAW data hazard is detected at a later stage leading to a wrong system state, so-called elastic buffers are added between the pipeline stages. This allows only the pipeline stages starting with the hazard detection to be flushed, so that earlier stages do not have to be repeated. While this can theoretically improve throughput compared to a full pipeline flush, the authors’ evaluation showed that the actual probability of flushes caused by per-flow states and the resulting degradation of throughput is low. In the case of frequently used global states such as packet counters, eBPF provides an atomic operation to avoid concurrent access to the same map. To realize the same behavior in hardware, eHDL provides a block that performs in-place Key-Value (KV)-based lookups.

Discussion. The advantage of the approach presented in [97] is that it utilizes only the hardware components required for the given instructions. This avoids the overhead of instruction fetch and decode phases like in [42, 46] or the need for hardware to execute/accelerate instructions that might not be used in some cases. Both eHDL and SDNet implementations achieved line-rate performance compared to hXDP and BlueField-2. For the tested use cases, they delivered up to about \(164.4\) times higher throughput compared to hXDP and up to about \(23.4\) times higher throughput compared to BlueField-2. This demonstrates that dedicated hardware pipelines for packet processing can achieve significantly higher and more consistent performance compared to GP/MuSoC-based architectures, while being more power efficient.⁶ Additionally, eHDL requires less than \(20\%\) of the resources available on the FPGA (Figure 5), leaving room for scalability or providing multiple services simultaneously. However, a disadvantage exists: Adapting the architecture to a new program requires synthesis and FPGA reconfiguration. The authors of [97] acknowledge this and consider using DPR for future work. This would allow adding, replacing, or removing hardware pipelines at runtime depending on the specific services needed and/or the required throughput. While DPR might solve the downtime issue during reconfiguration, the synthesis process itself remains a significant overhead and is not suitable for fast redeployment. A potential solution to this problem, as discussed in Section 8.5, could be the use of overlays.

Fig. 5.

9.3 Proposals for Different Application Domains Realized on FPGAs

In the following we selected work conducting research on the domain of Domain Name Server (DNS), CPs, caching, ML, and security. For each domain we will first introduce the domain, present challenges, and discuss proposed works that address (some) of these challenges.

DNS. The DNS service is one of the most essential services on the Internet. It defines the mapping of human-readable domain names to machine-readable IP addresses. However, as the demand for scalability of network services such as DNS to handle an increasing number of queries while providing low latency increases, software-only solutions will no longer be sufficient.

Kiwi [104] is a HLS system to generate an RTL (Verilog) output from C# code that can be used to program FPGAs. Emu [111] is a standard library that supports the implementation of network functions with Kiwi. Part of the Emu library is to provide a simple DNS server (Emu DNS in Table 3) that can handle non-recursive queries and query resolution with names of at most 26 bytes for IPv4 addresses. However, according to the authors, these restrictions can be relaxed to handle longer names and IPv6 addresses.

Another approach to implementing a DNS service on FPGAs is P4DNS [124]. Unlike Emu, P4DNS is implemented in P4 using NetFPGA. P4DNS provides a simple low-level name server that supports recursive queries. The implementation uses features from both the data and control planes to achieve high performance. The data plane implementation is based on the RMT pipeline with some modifications responsible for the performance critical parts of the design.

Discussion. When investigating basic network services such as DNS with modern network processing models/languages such as PISA/P4, problems became apparent. First, the generality of the FSM for the parser produced by P4C would lead to increased compile time and resource usage because it would produce even hard the headers (Ethernet, IPv4, UDP, and DNS) that could be compressed into a single state. In addition, the P4C does not support variable-length fields, but DNS headers can have different lengths, the parser implementation supports multiple lengths, each with its own state [124]. Although P4 provides a vendor-independent model, the restrictions and limitations create strong hurdles that may make implementation for INC inadequate or even impossible. Therefore, further research is needed in the direction of providing/improving high-level abstraction models and (automatic) tool flows capable of efficiently translating and mapping such a model to reconfigurable hardware. In order to realize such tool flows, the investigation of improved HLS compilers for C# [104], C or C++ (e.g., from AMD/Xilinx, Intel), automatic integration on FPGA equipped SmartNIC and switches, and libraries such as Emu [111] will be of importance to ease the hurdles for implementations of reconfigurable hardware accelerators targeting INC.

CP. A fundamental problem in distributed computing is how to ensure the reliability of distributed systems while processes may be faulty. Therefore, the goal of CPs is to find consensus among processes about states or data needed for computation. The types of process failures in a distributed system can be divided into two types: Crash failures and Byzantine failures. Crash failures describe scenarios in which a process abnormally terminates its execution and cannot be resumed. Byzantine failures, on the other hand, describe arbitrary process failures. These can include cases such as crashes, incorrect states, failure to send and receive messages, sending messages with incorrect or even malicious content, and so on. Byzantine failures can disrupt other processes or cause failures that are unintentional or caused by a malicious attacker. However, they cannot control the network. Any process can uniquely identify the sender of a message [61]. Crash failures are handled by Crash Fault Tolerance CPs, and Byzantine failures are handled by Byzantine Fault Tolerant (BFT) CPs. However, to meet the demands for scalability, high bandwidth, and low latency, BFT CPs implemented in software would become a bottleneck.

Sit et al. [105] implemented the Practical Byzantine Fault Tolerance (PBFT) algorithm [47] to study different configurations of it. In their work, they show that it is unlikely that networks with more than 10 Gbps can be saturated without the use of hardware accelerators. Therefore, they investigated offloading the processing to SmartNICs and standalone FPGAs.

Discussion. One major challenge they identified for offloading PBFT algorithms is that the iterative computation of cryptographic algorithms such as RSA is difficult to realize on FPGAs because of low frequency and would account for a large portion of the logic utilization for a PBFT implementation. Nevertheless, their work shows that with SmartNICs, fine-grained batching with strict latency guarantees is a reasonable option for acceleration, and that mid-range FPGAs such as the Xilinx Virtex-7 690T are already a good target platform for PBFT implementations. In general, FPGAs perform best for high-parallel algorithms, but are not the best choice for iterative algorithms where higher frequencies are beneficial. However, we are seeing a trend where major FPGA manufacturers are offering PS tightly coupled to the FPGA’s PL on a single SoC that can also be used to implement the iterative parts of an algorithm, providing the benefits of both worlds.

Intrusion Detection and Prevention Systems(IDS/IPS). IDS/IPS are critical components of the network infrastructure to protect systems from attacks. Both systems scan network packets and compare their contents to a database of known threats. However, an IDS is only responsible for detection and monitoring and does not intervene on its own, while IPSs accept or reject packets based on a set of rules. Network operators are faced with the challenge of securing networks with hundreds of thousands of concurrent connections with tens of thousands of rules to be checked by IDS/IPS. To ensure future scalability, software-only solutions are insufficient and PISA designs such as the Tofino switch cannot be used to implement a full IDS/IPS.

Therefore, Zhao et al. proposed an FPGA-based solution for SmartNICs, called Pigasus [132]. In general, the concept of IDS/IPS is based on solving pattern matching problems (header matches, string matches, and regular expressions [132]). The proposed design consists of an FPGA connected via Ethernet and CPUs connected via PCIe. The parser, reassembler, and Multi-String Pattern Matcher (MSPM) (and its extensions Non-Fast/Fast Pattern String Matching) are implemented on the FPGA. The Regular Expression and Full Match stages are implemented on the CPUs. However, they only interact with about 5% of the packets. A core component of the design is the reassembler, which takes the forwarded packets from the parser, sorts the TCP packets, and records the last bytes of the previous packets to allow contiguous and cross-packet searches for the MSPM. To fit the Snort register ruleset [25, 98] into the available BRAM of the Intel Stratix 10 MX FPGA, the proposed design uses Hyperscan-inspired hash table lookups [123] for the MSPM instead of state machines for exact matching. For the full matcher on the CPU side, an adapted version of Snort 3 is used to receive Protocol Data Units and rule IDs. With their FPGA-based SmartNIC design combined with a CPU system, they were able to develop a single-server solution for IDS/IPS.

Discussion. The authors showed that it is generally possible to implement a complete IDS/IPS on a single FPGA. Only the full match stage was still running on a CPU because it is not the bottleneck for most packages and an implementation would have required too much BRAM (\(\sim\)24 MB) for only about 5 Gbps of traffic. Also related to the limited amount of on-chip memory, the authors faced the problem that there is a design tradeoff between the number of rules, concurrent flows, and the number of pipeline replications. While they could achieve 200 Gbps throughput with two pipelines, they were quickly limited by the number of rules they could store in the BRAM. However, since newer FPGAs can provide more on-chip memory (up to several hundred Mb), for example, in the form of Ultra-RAM (URAM) or up to several tenths of GB of HBM in a single package, implementing a complete IDS/IPS on a single FPGA may be possible with modern FPGAs.

Caching/KVS. Local storage/caching of data as close to a PE as possible is a proven technique for achieving low latency, not only for GPPs, but the concept is also used in large systems such as data centers. The reason for this is that it is not practical for data centers or other distributed systems to store all data locally at each node, which increases storage costs or requires additional synchronization to update all data over the network. Cloud server systems also use caching techniques to store frequently requested data on cloudlet servers, while requesting data from cloud servers only when it is missing from the cache to exploit data locality [37]. To meet the increasing demands for low latency, even with a large number of end-user devices, data must be stored even closer to or directly at the edge.

LaKe. LaKe [117] is a hardware solution implemented in Verilog based on the Memcached design [9, 57]. LaKe is implemented on a NetFPGA-SUME [136]. To hide access latency, the design provides two cache layers. The first layer is a shared cache consisting of fast but small on-chip BRAM, and the second layer consists of slower but larger on-board DRAM. The DRAM is used to store hash table buckets and data chunks, while the BRAM is used as a shared cache between PEs to store frequent KV pairs. In addition, a slab allocator also used in Memcached is implemented in hardware, using the available on-board SRAM to store the addresses of unused chunks, and a FIFO to preload the next available address to hide access latency.

Discussion. Their research showed a significant improvement in throughput, power efficiency, and especially latency compared to a host-based KV system. While the authors evaluated their design on a SmartNIC, they noted that such a design could also be implemented on switches. With the growing number of participants, such as IoT devices, the need to keep data close to the end devices within the network becomes apparent. This not only minimizes latency and offloads resources from the host but also relieves transmission pressure within the network, further reducing latency, when deployed on switches. In addition, conformance and seamless integration with existing technologies such as Memcached is of great importance to keep the hurdles for developers as low as possible, making it feasible to integrate INC designs into new and existing infrastructure.

ML. To efficiently train ML models, a large amount of data is required. Scaling, i.e., upgrading a device/server with more powerful hardware, is limited and economically unfeasible beyond a certain point. Therefore, a common approach to scaling large models and data volumes is to exploit data parallelism, not only on a single device, but by partitioning and distributing the data across multiple worker nodes. Training data in parallel over the network and synchronizing updates is supported by popular ML frameworks such as PyTorch [22] (Distributed Data-Parallel training) and TensorFlow [28] (tf.distribute.Strategy). Supervised ML models use iterative algorithms such as Stochastic Gradient Descent, where the sum of model updates must be computed frequently. This leads to periods of high network load for transmitting the updates, which results in the need to pause training until the updates are complete.

Liu et al. [85] proposed an in-network aggregation approach based on Remote Direct Memory Access (RDMA) [86] called NetReduce. They connected a Virtex Ultrascale FPGA [30] to a standard (nonprogrammable) switch. In their proposed design, updates sent by worker nodes are aggregated in the network and only the resulting aggregated gradient values are forwarded. They compare their proposed design to SwitchML [100], which is similar to NetReduce. However, SwitchML has been implemented for programmable ASIC switches (Tofino). To interact with the proposed switch design from the host side, they reuse the RDMA over Converged Ethernet protocol for their NetReduce protocol and the NVIDIA Collective Communication Library (NCCL) for optimized communication with the NVIDIA GPUs in their testbeds. They used the NetReduce protocol as a primitive in NCCL and used it in PyTorch and the Horovod framework to support TensorFlow.

Taurus [113] presents a modified RMT pipeline for the data plane by introducing a MapReduce control block based on CGRA Plasticine [95]. Their design targets line-rate inference for ML models such as Support Vector Machine, Deep Neural Network, k-means, and Long Short-Term Memory. However, in the absence of a CGRA-based programmable switch, the authors build a prototype using a Tofino switch connected over a 100 Gbps Ethernet link to a Xilinx Alveo U250 to emulate the MapReduce block and evaluate their design. In their follow-up work, Homunculus [114], they present a framework that supports both Taurus switches and programmable ASIC switches. Homunculus provides functions in Python that can be used, for example, with the TensorFlow function. The Python front-end generates JSON configuration files that are optimized by HyperMapper [92]. The back-end takes the optimized configuration to generate P4 or Spatial [78] code, depending on the target switch. To generate the P4 code targeting Tofino switches, they use the IIsy framework [133].

Discussion. The standard P4 programmable PISA architecture is not sufficient to accelerate ML in the data plane, according to [85] and [113]. In addition to the lack of features to implement ML models in PISA architectures [113], storing the results of the ML model [113] or the RDMA link [85] as flow rules would lead to frequent interaction with the control plane in modern dynamic network environments [85, 113], which should be avoided as it leads to additional delays in message processing. For example, in the context of inference in the network, the authors of Taurus [113] evaluated that it is not ML but rather table rule installation and packet collection that becomes the bottleneck when telemetry packets are sent to the control plane. In their study, they showed that this high delay causes the design to miss most anomalies, resulting in a low F1 score. When inference is performed on the data plane using the CGRA instead, they achieve a high F1 score regardless of the sampling rate (Table 3). However, for online training, they evaluated that using the control plane to optimize the global metric and update the weights of the Taurus ML model, that higher sampling rates and epochs with small batch size lead to faster convergence of the F1 score (Table 3). Their work showed that using ML models to optimize network functions such as anomaly detection can be realized in the network, but also that current PISA-based switches lack the necessary features and CPU-based do not provide the necessary performance to run ML models efficiently [113], which confirms the study conducted by Cooke and Fahmy [51].

9.4 Lessons Learned and Open Questions

We reviewed research proposing different hardware implementation solutions for different domains. We found that many challenges arise from the limitations of existing PISA-based architectures, but also that there is no one-size-fits-all solution. While some problems, such as variable field lengths, could be solved using the extern calls introduced with P4\({}_{16}\), P4 implementations would lose their generality/portability [124]. In addition, many PISA design solutions would require frequent interactions with the control plane to offload processing to the network [79, 85, 113] or lack necessary features [85]. Proposals such as PsPIN [53] can provide high performance despite a high degree of flexibility and parallelism, but there are limitations as soon as domain-specific acceleration is required. While [109] and [102] investigated virtualization for multi-tenant program offloading on reconfigurable switches, including security aspects [109] and a discussion about challenges regarding PR [102], their investigated scope was limited to basic network applications like L2-Switches and INT. Works such as [81, 82] have included the investigation of hardware scheduling techniques for heterogeneous SmartNIC devices. However, several areas remain open for future work, including automatization of finding the right partition size, and task mapping mechanism to decide which task or task chain to offload to a specific PR and when to perform PR.

As we have seen in the study by Sit et al. [105], while it is possible to implement entire services on FPGAs, implementations, for example, iterative algorithms may suffer from low achievable frequencies or some parts as in the work of Zhao et al. [132] are not easy to implement on hardware accelerators and the performance gain to offload such parts is not in reasonable proportion to the effort. Liu et al. [85] conclude that offloading a task such as in-network aggregation to a SmartNIC is generally only useful to free up the host CPU, since implementing a hardware accelerator requires a lot of development effort, and MuSoC-based SmartNIC such as NVIDIA’s BlueField DPUs may suffer from similar latency and throughput issues as existing in-network aggregation solutions.

Designs such as hXDP [46] have proposed combining an eBPF-compliant VLIW processor with tightly integrated accelerators and WE [42]. However, as the accelerators in hXDP are fixed at runtime, they lack like state-of-the-art MuSoC and MaSoC in the flexibility to exchange hardware accelerator online. eHDL [97] on the other hand provided an automatic eBPF \(\rightarrow\) FPGA workflow to create a whole pipeline instead. They were able to improve throughput compared to hXDP and BlueField-2 [42, 46], with similar latency compared to hXDP (Table 3). However, like hXDP [46], they do not provide online reconfigurability. On the contrary, hXDP even has the advantage that it can be reprogrammed and the WE can be repopulated at runtime [46]. Therefore, realizing online reconfigurability to support offloading and exchanging of eBPF programs (or programs in general) without reconfiguring the whole FPGA and interfering with other eBPF programs remains an open challenge for research.

The possible range of applications that can be covered with INC is wide. It includes domains like network services, security, and ML, which we have discussed in this work, but also extends to areas such as Computer Vision and robotics. To make the utilization of reconfigurable hardware accelerators attractive, not only the network community but also from other domains have to be convinced. Programming diverse network devices requires architecture-specific knowledge, even with tools like SDNet/VNP4 for FPGAs. While [42, 46, 97] investigated the offloading of eBPF/XDP to hardware accelerators, their work required implementing custom compilers for specific targets and hardware knowledge to add new features. DSLs and libraries that abstract target-specific functionalities and optimizations could potentially help to lower this burden. Additionally, compiler infrastructures like MLIR [80] can provide reusability and extensibility for rapid deployment of domain-specific compilers or feature extensions. Therefore, the investigation of design choices to support hardware acceleration for various applications without burdening application developers [135], including the seamless integration of reconfigurable hardware accelerators into existing network and application infrastructures and paradigms, is a potential topic that few have addressed. To determine the benefits and identify potential challenges of INC hardware acceleration, further research under different network traffic conditions and consideration of custom protocols [108] is needed. Additional research challenges such as programmability and adaptability are discussed in the following section.

10 Additional Open Research Challenges

Although research in the field of SDN and INC has made considerable progress, we have seen in the literature that developers and researchers face significant challenges in programming the data plane.

Limited Compute Capabilities and Adaptability. Delivering high performance from each compute device in the network, while being cost and energy efficient, cannot be achieved by software alone (Section 4). Accelerators for cryptography, for example, are implemented as hardened IP in most modern architectures. However, for many applications where existing hardened accelerators cannot be used, these devices must rely on GP cores for processing. While proposals such as eHDL [97] provide flexibility and linear performance, they come with the disadvantage of high deployment (and development) time in case the accelerator needs to be adapted. To address these issues, we believe that techniques such as JIT compilation for coarse- or multi-level hardware accelerators and the integration of embedded reconfigurable hardware such as embedded FPGAs, which may be tightly coupled to GP cores for INC acceleration, need to be further investigated.

Computation vs. Latency and Throughput. Ports and Nelson [94] concluded that PISA-based programmable ASIC switches achieve similar latency to standard nonprogrammable switches due to their fast lookup capabilities, but they lack flexibility. FPGA-based switches, on the other hand, could provide the required flexibility. However, putting more computation into the devices could come at the expense of achievable throughput and also increase latency if not done carefully.

Limited Memory and Bandwidth. Another problem is the low memory capacity of in-network hardware. High-bandwidth memory on programmable network devices is scarce. Both FPGA-based and programmable ASIC switches are limited to a few tenths of a MB of on-chip SRAM. Modern Xilinx FPGAs have a larger on-chip memory, URAM, in addition to BRAM, but the total on-chip memory is still in the range of a few hundred MB. FPGA-based devices can be configured to access on-board DRAM. However, DRAM access comes at the cost of much lower bandwidth. In addition, a few gigabytes is not much in the age of petabytes of data on servers. Therefore, in addition to a well-optimized implementation of the functionality, caching techniques are required to make efficient use of the available on-chip and on-board memory.

Abstraction and Orchestration. Benson [41] identified management challenges associated with INC. One of the challenges identified is the need for an orchestrator capable of placing functionality into the network. In order for the orchestrator to make the right decisions about what functionality to place where and when, an abstraction of the network topology is required that takes into account the available resources and capabilities of the network devices, as well as the current states of each device, e.g., network load, faults, and so on. As stated in [114], the placement of algorithms, especially in complex networks, can no longer be done by network operators alone, but requires the support of an automated orchestrator.

Multi-Tenancy and Virtualization. To facilitate hardware abstraction for orchestration, virtualization techniques will be essential to provide a more unified view of all available nodes in the network [69]. To better utilize available resources and provide high performance to multiple clients, it is necessary to be able to run different tasks simultaneously on the same device. Such virtualization techniques must provide isolation-related features, both in terms of security and performance, while allowing resource sharing and not introducing large overheads [64, 122].

Encryption/Security. Network traffic encryption, while necessary for security, hinders INC as computations can’t be performed on encrypted packets. Traditional key-sharing encryption isn’t practical for INC. Mafioletti et al. [87] demonstrated how P4 programmable INC devices can be exploited for attacks on Robot Operating System-based robotic systems. While solutions like SROS2 [88] exist, they have limitations and aren’t directly applicable to INC solutions like to [79]. Homomorphic Encryption (HE) is a potential approach, but current HE schemes are either computationally expensive or provide limited security [33]. However, research in this area is ongoing and has made promising progress.

Time Constraints and Priority Handling. INC research often lacks consideration of priority handling and congestion control. In industrial scenarios, like robot control sharing resources with other applications [79], prioritization is crucial. Adaptive decision-making is needed for processing and routing packets based on current and projected network load. This includes determining when to perform INC, considering factors like prioritizing high-priority packets during high load, forwarding low-priority packets to avoid congestion, and deciding on which nodes to compute time-constrained packets.

11 Conclusion

In this work, we provide an overview of ongoing research in the area of SDN and INC, with a primary focus on how reconfigurable hardware can contribute to this area.

First, with the advent of 6G technology and IoE, we gave motivation and presented background information on SDN and INC. We introduced P4, the most popular data plane programming language today, but also discussed eBPF/XDP and DPDK, popular frameworks for packet processing implementations. We categorized the architecture for SmartNIC and switches. We also discussed that CGRAs could be a good alternative architecture to be used in certain domains due to their faster reconfigurability and higher achievable frequency compared to FPGAs. Finally, we presented state-of-the-art research on architectures and in different domains for accelerating applications in the network. We discussed the challenges and how reconfigurable hardware accelerators could contribute. With this survey, we hope to provide some insight into this complex topic and motivate the community to further research on the use of reconfigurable hardware accelerators in the context of SDN and INC.

Acknowledgment

The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany in the programme of “Souverän. Digital. Vernetzt.”

Footnotes

IPSec and the closely related TLS are both protocols (or a group of protocols) for creating encrypted connections between devices for potentially unsafe networks, like the Internet. IPsec provides security on the network layer and TLS on the transport layer, providing security for application layer protocols such as HTTP.

But the egress pipeline support can be enabled by setting an additional compiler flag.

There are also several minor differences, such as the two-sided crossbar and the different scheduler design.

⁴

A generic memory for data sharing between the kernel and user space that can contain data of different types.

⁵

The standard ISA supports 1, 2, 4, and 8 byte alignment only.

⁶

The test setup consumes with the Alveo U50 20-25W less power compared to the BlueField-2 [97].

References

[1]

2014. Network Functions Virtualisation (NFV) - Network Operator Perspectives on Industry Progress. Technical Report. ETSI, SDN & OpenFlow World Congress, Dusseldorf-Germany.

Abstract

1 Introduction

2 Related Work and Motivation

3 Enabling Technologies

3.1 SDN

3.2 NFV

4 INC

5 Taxonomy

6 Programmable Switches and SmartNICs

7 Programmability of Network Devices

8 Architectures

8.1 GPPs

8.2 ASICs

8.3 FPGAs

8.4 SoC

8.5 CGRA

9 Exploring FPGA-Based Designs for the PDP

9.1 SmartNIC and Switch Designs for Multi-Domain Applications on FPGAs

9.2 Investigation of SmartNIC Designs for eBPF/XDP Offloading to FPGAs

9.3 Proposals for Different Application Domains Realized on FPGAs

9.4 Lessons Learned and Open Questions

10 Additional Open Research Challenges

11 Conclusion

Acknowledgment

Footnotes

References

Index Terms

Recommendations

A hardware implementation of distributed network protocol

DBHI: A Tool for Decoupled Functional Hardware-Software Co-Design on SoCs

Co-simulation framework of SystemC SoC virtual prototype and custom logic (abstract only)

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations