[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing

Published: 25 December 2024 Publication History

Abstract

By moving data and computation away from the end user to more powerful servers in the cloud or to cloudlets at the edge, end user devices only need to compute locally for small amounts of data and when low latency is required. However, with the advent of 6G and Internet-of-Everything, the demand for more powerful networks continues to grow. The introduction of Software-Defined Networking and Network Function Virtualization has allowed us to rethink networks and use them for more than just routing data to servers. In addition, the use of more powerful network devices is bringing new life to the concept of active networks in the form of in-network computing. In-Network Computing provides the ability to move applications into the network and process data on programmable network devices as they are transmitted. In this work, we provide an overview of in-network computing and its enabling technologies. We take a look at the programmability and different hardware architectures for SmartNICs and switches, focusing primarily on accelerators such as FPGAs. We discuss the state of the art and challenges in this area, and look at CGRAs, a class of hardware accelerators that have not been widely discussed in this context.

1 Introduction

With the introduction of 5G and the advancement of Internet of Things (IoT) technology, the demands on network traffic and network processing are rapidly increasing [93]. With the focus now on the development of 6G and the Internet of Everything (IoE), this trend will accelerate even further [73, 91]. In addition to the need for high throughput and low latency, there is a new demand for energy efficient computing that requires better utilization of available network resources. However, legacy networks are transmission pipelines that are primarily used to move data from typically low-performance data sources to centralized cloud servers where the data are computed. Rather than trying to increase the computing power of the data source itself, which is especially limited for mobile devices, cloud computing offers the benefits of greater scalability, more available resources, and better cost efficiency.
To address this problem, the concepts of Multi-Access (originally Mobile) Edge Computing [93], fog computing [99], and cloudlets have been introduced. Cloudlets are dedicated and computationally weaker (compared to the cloud) servers placed in close proximity to user devices [89, 96]. However, even though edge computing reduces latency and network load, the problem of insufficient network utilization remains. With the introduction of Software-Defined Networking (SDN) [89] and Network Function Virtualization (NFV) [128], as well as network devices that can do more than data forwarding or simple network-specific transactions such as Network Address Translation (NAT) [56], the research area of In-Network Computing (INC) [77] has emerged. INC moves the computation (of parts) of applications that previously ran on servers into the network. This not only enables more efficient use of available resources, but can also reduce network traffic load, increase throughput, and reduce latency, making INC an important piece of the puzzle on the road to 6G networks. To enable INC, several hardware designs for Smart Network Interface Card (SmartNIC) and programmable switches have been proposed and implemented by the research community as well as by large companies such as Intel, NVIDIA, and AMD/Xilinx. The range of designs explored includes solutions using General Purpose Processors (GPPs), Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), or System-on-a-Chip (SoC), as well as lesser-known architectures such as Coarse-Grained Reconfigurable Architectures (CGRAs) [83].
The rest of this work is organized as follows: Section 2 presents related work and our motivation. Sections 3 and 4 give a brief introduction to the enabling technologies for INC itself. Section 5 discusses the taxonomy presented in this work. Related to the taxonomy, the following sections will discuss the types of network devices for INC (Section 6), the programmability of these devices (Section 7), and existing architectures (Section 8). Section 9 presents research on proposed hardware designs, including for extended Berkeley Packet Filter (eBPF)/eXpress Data Path (XDP) [119] program offloading and optimized designs for specific domains. Section 10 discusses ongoing research challenges, and Section 11 concludes this work.

2 Related Work and Motivation

Comprehensive explorations of SDN and INC for applications and domains such as In-Band Network Telemetry (INT), caching, Machine Learning (ML), congestion control, Consensus Protocols (CPs), security, and resilience/robustness from different perspectives have been done by [67, 75, 77, 89], among others.
Kfoury et al. [75] and Hauser et al. [67] focused on the Programming Protocol-Independent Packet Processor (P4) Programmable Data Plane (PDP). Kfoury et al. [75] compared P4 programmable networks with non-P4 programmable SDNs. Hauser et al. [67] provide an overview and tutorial on different targets/architectures (ASIC, General Purpose (GP)/Software, Network Processing Unit (NPU), and FPGA), compilers, and APIs.
Michel et al. [89] placed a stronger emphasis on network-related applications for SDN. Besides applications, they categorized the PDP concerning architectures, abstraction, algorithms, and languages, paralleling with a similar categorization of architectures as in [67].
Kianpisheh and Taleb [77] introduced a classification for INC applications, dividing them into (general) PDP INC applications and technology-specific INC applications. The former includes In-Network Analytics (data aggregation, ML, etc.), In-Network Caching, In-Network Security, and In-Network Coordination, while the latter focuses on infrastructure-related applications such as cloud/edge computing, 4G/5G/6G, and NFV.
Other studies address hybrid network architectures that include non-SDN and SDN-capable devices [76], network virtualization [35, 64], the distributed control plane [38, 50, 71], or focus on specific domains such as ML [126, 134].
In our research, similar to [67, 75, 77, 89], we also examine works that focus on offloading applications to the network. However, in contrast to previous work, we want to propose a different classification for programmable network devices. For example, rather than categorizing them as SmartNICs, FPGAs, and switches, we emphasize that their architectural characteristics and data processing capabilities are independent of the network device type. For instance, an FPGA can be used to process data on both a Network Interface Card (NIC) and a switch. Our investigation focuses mainly on recent work that has investigated/proposed architectural designs, accelerators, and frameworks for INC, primarily prototyped on FPGAs. We will discuss FPGAs as an interesting choice for processing data in the network because they can provide flexibility, acceleration, and power efficiency in contrast to ASIC-based designs such as the Tofino that lack flexibility and GPP-based designs that lack power efficiency and acceleration capabilities. But we will also look at the problems in working with FPGAs, how coarse-grained granularity could complement them, and why CGRAs could be a considerable addition for data processing in dynamic network environments.

3 Enabling Technologies

In legacy networks, the decision on how data are forwarded as well as the functionality of the device was tightly coupled to the hardware. With the advent of SDN and NFV, the hardware-defined functionality and packet forwarding policy has been replaced by a more software-defined approach. SDN and NFV are two most important technologies for modern networks, enabling programmable and more flexible networks. In this section, we give a brief introduction to these two technologies.

3.1 SDN

Two important questions arise when transmitting network packets. The first is how the packets are transmitted from one device to another, and the second is how the packets are handled by each device (how they move through a device). To abstract the problem, the functionality of network devices can be logically divided into control plane and data plane [89].
The control plane is responsible for handling packet traffic transmitted via or generated by the network device. These packet processing policies include functions such as updating headers, forwarding packets, and device management, including the allocation and assignment of resources for the data plane. Since the control plane needs to make more complex decisions, these are usually implemented in software running on GPPs [137] rather than on specialized hardware such as an ASIC. However, since software processing on a GPPs is quite slow compared to a hardware implementation, it is often referred to as the slow path.
The data plane of a device, on the other hand, is responsible for processing or forwarding of packets that are defined by the policies of the control plane. To achieve high throughput and low latency, path switching/routing is usually implemented on an ASIC. However, for Virtual Switches (vSwitch), which are used for communication with Virtual Machines (VMs) and thus enable the simultaneous use of the physical infrastructure, the data plane can also be implemented in software. Although the data level of a vSwitch is slower than the data plane of a physical switch, it is still orders of magnitude faster than the control plane of a device.
In legacy networks and network devices, the control and data plane were tightly coupled. They therefore lacked flexibility and expandability. The introduction of SDN decouples the control and data plane and creates a (locally) centralized control plane through a network controller. The network controller enables functions such as global policy management, event-based triggering, centralization of network information, and other global network management and monitoring applications. According to the Open Network Foundation, the SDN architecture consists of three different layers that are connected to each other via interfaces [45] (Figure 1):
Fig. 1.
Fig. 1. Overview of the SDN concept. In SDN, the control and data planes are decoupled and communicate over the SBI. Applications can communicate with the control plane through the NBI. Multiple SDN controllers of the distributed control plane can communicate with each other via the EWI.
Application Layer: All SDN applications are located at this layer. The applications communicate with the control plane via the Northbound Interface (NBI).
Control Layer/Control Plane: Logically centralized layer that monitors the network, translates the requirements of the applications, and provides information for the applications via the NBI and controls the network infrastructure via the Southbound Interface (SBI).
Infrastructure Layer/Data Plane: Consists of the Network Elements and devices, responsible for the packet switching and data forwarding. This layer is exposed to the control layer through the SBI.
Because a centralized view of the network in the control plane is pursued the network controller was realized on a single machine. However, while this goal can be easily achieved for small networks, for larger networks like in data centers or for IoT-related networks with a large number of diverse actors each of them having different demands, a single centralized network controller becomes fast a bottleneck. Therefore, several approaches target multiple distributed controller in a network and create the global by synchronizing with each other [125]. This (physically distributed but) logically centralized can be realized in different ways. While the simplest approach is to synchronize between all controllers it scales not well. Therefore, approaches like Kandoo have an hierarchical approach. All small transactions are handled individually by each local controller. Larger transactions on the other hand are controlled by a single network controller, which has more computational capacities compared to the local controllers. SDN not only provides us with the capabilities to address high-bandwidth demands by monitoring and controlling the network traffic but also to adapt to changing business needs and a reduction of network management complexity and operations [54]. Researchers and network operators are capable to rethink the idea what networks are, as well as how to design and interact with them.

3.2 NFV

In legacy network architectures, any network function beyond simple forwarding, such as firewalls, load balancing, and NAT, was tightly coupled to dedicated middlebox hardware. As a result, any change in network functionality, such as an update to a new protocol, is expensive and time-consuming [128]. NFV virtualizes this Physical Network Function (PNF), allowing them to be deployed on commodity servers as software implementations running on VMs, eliminating the need for (expensive) dedicated middlebox hardware. This allows different network services to run on the same equipment, generally reducing network Capital Expenditure and Operation Expense compared to the traditional PNF approach. NFV is complementary to SDN, but interdependent. This means that SDN can be deployed and benefit from NFV and vice versa, but both can be deployed independently [1]. However, NFV has in common with SDN that both technologies aim to decouple control and application software from proprietary hardware to enable a more flexible and dynamic network infrastructure.

4 INC

The idea of programmable networks is not new, and was proposed as early as 1997 in the form of active networks [116]. The idea of active networks is to offload the computation of algorithms to the network devices, such as switches, so that they can perform more sophisticated tasks with the data packets than just forwarding them. Although the idea aroused interest, it was not successful at the time due to a lack of computing power and programming models for such network devices.
With the introduction of SDN and the emergence of high-performance programmable network devices such as SmartNICs and programmable switches, we have come a big step closer to realizing the idea of moving computation to the network. Mature programming languages such as P4 [18] and Protocol-Oblivious Forwarding (POF) [107], which is an extension of OpenFlow, also allow heterogeneous network devices to be programmed without having to write code for each device individually. However, while P4 and POF both share the same goal of bringing programmability and reconfigurability to the network, P4 has become the most popular programming language for data plane programming in industry and academia because, in part because it provides a High-Level Intermediate Representation [110].
SDN-enabled devices provide us with the ability to perform computation in the network, but there are still many challenges to overcome to make INC feasible and beneficial. For example, Yang et al. [127] showed that it depends on the application whether INC is beneficial or not. In some cases, existing INC solutions can be counterproductive and even degrade performance.
According to Cooke and Fahmy [51], INC is more effective when the edge (data source, e.g., IoT device, sensor with microcontroller) and intermediate nodes (in-network resource, e.g., cloudlet, switch) have comparable capabilities to the central node (e.g., cloud server). In their case study, they found that as long as the computational capabilities of the intermediate nodes are about one-fifth of the central node, they still outperform the central node. However, as the gap becomes larger, the central solution becomes preferable. They argue that it is unlikely to close this gap with traditional software solutions. They argue that a more sensible approach would be to use hardware accelerators such as FPGAs at the edge and intermediate nodes to bridge the gap between powerful but inefficient CPUs at the central node, while being more cost and energy efficient.
Hu et al. [69] propose a paradigm that moves away from handling data at the packet level and considers all network nodes, including central nodes, as a unified environment based on network hardware virtualization and containers. This would enable a task to be placed on any node and therefore allow a greater degree of flexibility. In their case study for data aggregation, query, and monitoring, they showed that the ingress traffic to the cloud server and the overall energy consumption can be significantly reduced using INC. However, for queries, INC is only feasible when queries are repeated more frequently and the hit rate increases.

5 Taxonomy

Based on the definitions discussed previously, we propose the following taxonomy for programmable and reconfigurable network devices (Figure 2).
Fig. 2.
Fig. 2. Overview of the proposed taxonomy for programmable and reconfigurable network devices.
Network Device Type. As in the classic taxonomy, we can categorize the network devices based on their location and function in the network. However, the focus of this work is on (programmable/reconfigurable) NICs and switches. NICs are located on the host device, for example, a server or a PC. A switch, on the other hand, is located inside the network and connects all devices in the network with each other. This means that while the NIC generally has to handle the host-specific traffic from and to the host device itself, the switch on the other hand has to handle a higher traffic load as well as traffic from a larger number of sources and tenants. This is an important fact to consider when offloading functions for INC in order to use the (specialized) compute engines efficiently.
Programming Language/Model. Over time, various programming models, languages, and frameworks for SDN have emerged, such as Frenetic, NPL, P4, Data Plane Development Kit (DPDK), and eBPF. While some frameworks use a General Purpose Language (GPL) such as C, they have limitations in terms of supported constructs. eBPF, for instance, does not support global variables. Domain Specific Languages (DSLs) such as P4 were developed directly for a specific execution model and generally do not provide constructs supported by the GPL like C.
Architecture. The architectures of programmable/reconfigurable network devices can vary greatly in their implementation details (Section 8). While most of the designs discussed in this work are used for NICs, some of these chips can and are also used for programmable switches, such as the AMD/Pensando Capri and Elba architecture [59], which is used for both NICs (so-called Distributed Service Card (DSC)) and switches (Aruba CX 10000 Series [3]). It should be noted that we categorize the architectures based on the data plane. This means, for instance, that an SoC with an FPGA can also have a multi-core Processing System (PS), but the execution of the data plane takes place primarily on the Programmable Logic (PL). We therefore do not categorize such devices as multi-core, but as Reconfigurable SoC (RSoC).

6 Programmable Switches and SmartNICs

According to the SmartNICs Summit [29], a SmartNIC is a NIC that has its own processing capabilities and can perform tasks independently of the host device. It is an accelerator that improves the performance and/or flexibility of tasks such as packet processing, security, and analytics. A SmartNIC by this definition has no specific architectural characteristics regarding which components process the (offloaded) data. For example, it can be a multi-core or many-core system (with cores optimized for network processing), have homogeneous or heterogeneous Computing Units (CUs), be equipped with an accelerator such as an FPGA, or have accelerators implemented as an ASIC. A similar definition as for the SmartNIC can also be applied to programmable/smart switches. While programmable network devices can be used to offload server tasks, they are not capable of running entire cloud services on their own. However, many of today’s cloud services are made up of many more or less loosely coupled micro-services that can be offloaded to SDN-enabled network devices. Liu et al. [84] investigated the offloading of microservices onto Marvell/Cavium SmartNIC to improve energy and cost efficiency in data centers. IPSec1 BM25, Recommend, and NIDS could benefit from the hardware accelerators available on the SmartNIC, and NATv4, Count, EMA, KVS, Flow Monitor, and DDoS could benefit from fast memory interconnect. Their work shows that microservices can benefit from dedicated hardware accelerators in addition to fast onboard memory interconnects. However, providing hardware accelerators for a wide range of services would result in a large increase in hardware resources on the network devices. The changing requirements of the accelerators, as well as the fact that some may not be used by the microservices, would make such a design not advantageous and could lead to a large increase in the power requirements of the network devices themselves. Therefore, a possible design could be to provide the most commonly used hardware accelerators as (programmable) ASICs and provide embedded reconfigurable hardware accelerators such as FPGAs that can be reconfigured by the data center service provider on demand or even online depending on the service requested.

7 Programmability of Network Devices

In this section, we will first give a brief overview of the three most common technologies used for programming network devices. The first technology in this section will be P4, a DSL originally established for programming switches, but which can also be used for a variety of SmartNICs. The second, the DPDK, and the third, the XDP, are two different approaches that have been proposed to improve performance for packet processing on a host system, but can be offloaded to SmartNICs.
The P4 is a DSL for programming network devices. In the first definition of the standard P4\({}_{14}\), P4 described the processing of packets in a Reconfigurable Match Table (RMT) (sometimes also called Match-Action Table (MAT)) pipeline with programmable parser and deparser targeting Protocol Independent Switch Architecture (PISA) devices [44]. The main idea behind the introduction of the RMT model was to create a Reduced Instruction Set Computer (RISC)-like programmable pipeline without compromising performance by introducing high latency for processing [43]. Therefore, the RMT pipeline consists of Ternary Content Addressable Memory for ternary and SRAM for exact matches.
The three main goals of P4 are as follows [44]:
Field Reconfigurability: The network controller should be able to redefine the packet parsing and processing in the field, without changing the hardware.
Protocol Independence: The network device should not be restricted to a (set of) specific packet formats. This means that the network controller should be able to redefine how the parser extracts information from the packet header and how to process it.
Target Independence: The same P4 program should run on different target devices, independent of the underlying architecture.
While P4 gives us new ways to interact with the network, it uses a very restrictive model that does not include typical programming constructs such as loops (only parsers are allowed to contain bounded loops). The advantage is that the computational complexity of a P4 program is linear, making it easy to predict how packets will be processed, and the strict processing structure allows the model to guarantee, in principle, that packets can be processed quickly across a variety of network devices. Some limitations can be partially overcome by recirculating packets after they pass through the egress pipeline, or by resubmitting them from the queue (and/or buffers) after the ingress pipeline [16]. Although P4 is already restrictive, vendors are also customizing P4 for their architectures, which not only makes P4 even more restrictive than the standard already defines but also undermines the third goal of creating a target-independent programming language for the PDP [65]. With the introduction of the 2016 revision of P4, called P4\({}_{16}\) [18], P4 took the next step and went beyond PISA to run P4 routers and NICs in addition to switches. With this came the ability to call external objects and functions provided by the architecture. This is an important step for interacting with custom architectures implemented on, for example, an FPGA.
With the introduction of P4\({}_{16}\), the P4 Language Consortium has also defined the Portable Switch Architecture (PSA) and is working on the definition of the Portable NIC Architecture (PNA). The PSA [19] specifies the structure and common capabilities of network switches (illustration of possible packet flows in Figure 3), and the definition of PNA [17] aims to provide the same concept for NICs. In addition, the PNA describes the interaction between the host system and the NIC. However, the definition of PNA is still under development by the P4 Language Consortium (version 0.5 at the time of writing [17]).
Fig. 3.
Fig. 3. Overview of the packet flow in the P4 Portable Switch Architecture (PSA) for two pipelines. The Packet Buffer (PB), Packet Buffer and Replication Engine (PRE), and the Buffer Queuing Engine (BQE) are target-dependent blocks. Multiple physical ports can be hardwired to one ingress/egress pipeline.
The DPDK [5] is one of the most widely used frameworks for high performance packet processing. DPDK bypasses the kernel by moving control of networking from kernel space to the application in user space. This eliminates the overhead of kernel/user space context switching and the kernel network stack. To avoid starting from scratch, DPDK provides many libraries to speed up implementation. To avoid the overhead of interrupts, DPDK provides a Poll Mode Driver (PMD), which is designed to allow a core to use constant packet polling. In addition, DPDK provides drivers for Direct Memory Access (DMA), ML, and Crypto supporting devices from a variety of vendors such as NXP, NVIDIA, and Marvell. While DPDK provides fast packet processing, both of its main components (kernel bypass and PMD) have their drawbacks. First, the kernel bypass has security and isolation issues that have already been solved in the well-maintained Linux kernel, and second, the constant active polling of the PMD keeps the CPU core permanently under full load. Originally developed by Intel, DPDK is now under the umbrella of the Linux Foundation and is primarily supported for Linux. While an official Windows port of the DPDK exists, it is still under development at the time of this writing [5].
One of the first projects to provide a P4 compiler with DPDK integration was T4P4S [120]. The T4P4S compiler generates hardware-agnostic C code from a P4 program. T4P4S first compiles the high-level description in P4 into a target-independent low-level description (called Core) and maps this description to the hardware-dependent functions (memory and queue initialization, persistent state/stateful memory management, I/O management and scheduling, packet manipulation, and metadata initialization/modification) using a Hardware Abstraction Layer (HAL) library (called NetHAL). Note that in [120] they only consider P4\({}_{14}\), but the newer version also supports P4\({}_{16}\) [27].
P4-DPDK [15] provides a DPDK integration into the P4 environment that is directly supported by the P4 community and, with the Infrastructure Programmer Development Kit [8], is an integral part of a project under the umbrella of the Linux Foundation. The P4-DPDK Software Switch (SWX) pipeline provides a SWX implementation of the P4 pipeline that uses the specification file created by the P4Compiler (P4C) DPDK backend to set up the (VM-like) software pipeline. The p4-dpdk-target uses the JSON files created by the P4C DPDK backend to define the front-end interfaces and the mapping of the P4 runtime information to the DPDK target-specific information. However, at the time of writing, only the ingress pipeline implementation is supported in the current release.2 Components such as Packet Buffer and Replication Engine (PRE) or Packet Buffer (PB) (Figure 3) and features such as timestamps or checksums are not supported.
While P4-DPDK and T4P4S have the same goal, they have different approaches. T4P4S uses a low-level Intermediate Representation (IR) and maps it to the hardware, hiding the implementation details via a HAL.
The eBPF [6, 103] is a lightweight RISC-like VM inside the Linux kernel. It allows user-defined low-level programs to run in a sandbox within the kernel. It thus enables the extension of the kernel’s capabilities without having to modify the kernel’s source code or load kernel modules. eBPF provides portability, flexibility, and security. A verification step checks that the program cannot intentionally or unintentionally harm the system, such as causing crashes, accessing out-of-bounds memory, or having an infinite loop that prevents the program from completing. A Just-in-Time (JIT) compiler then translates the bytecode into the machine-specific instruction set. eBPF is well supported and used by a large community as well as by the industry, for instance by NVIDIA. eBPF is partially supported by DPDK through a BPF library that allows eBPF bytecode to be executed within the user space of the DPDK application [21]. However, at the time of writing, the current version (23.07.0-rc2) does not support features such as external function calls for 32-bit platforms, and JIT is only supported for 64-bit x86 and ARM processors.
The XDP [68] was introduced to provide fast packet processing without losing features such as security mechanisms, management tools, and isolation provided by the OS. Unlike frameworks like DPDK that use kernel bypassing, XDP works at the lowest level of the Linux network stack, exposing a so-called hook to which an eBPF program can be attached, allowing the eBPF program to make early decisions on incoming packets without the overhead of the entire network stack. Therefore, in contrast to DPDK, XDP is transparent to the host. However, as network load increases, so does CPU usage.
Because DPDK constantly polls for packets while XDP uses processor interrupts, DPDK generally has a lower latency than XDP. For low-rate packet processing, the difference can be one to two orders of magnitude on average. However, for high-rate packet processing, the advantage of DPDK shrinks to less than a factor of 2.5 for latency.
This means that DPDK is advantageous for services that provide rapid response where low-latency packet processing is required, such as in real-time human–machine interaction applications. XDP, on the other hand, allows the use of the well-established and well-maintained features of the Linux kernel, reducing implementation and maintenance costs.
Like DPDK, P4 provides a compiler backend for eBPF programs [20]. However, at the time of writing, the current backend only provides support for packet filtering. P4 features like multicast, ternary table matches, or parser containing cycles are not supported yet.
The AMD/Xilinx Nanotube compiler and framework [10] allows eBPF/XDP code to be compiled into a processing pipeline in High-Level Synthesis (HLS) C++ that can be synthesized with Vitis HLS.
For the interested reader, a more detailed overview of eBPF and XDP can be found in [119].

8 Architectures

In this section, we will discuss different architectural approaches for SDN devices that have been proposed over the years. While the proposed designs target the same domain, they differ significantly in their design. Some have limited programmability, while others offer GP cores. Many proposals that use GP cores are marketed as Data Processing Unit (DPU) and authors such as [39] distinguish between SmartNIC and DPU. The distinction made by [39] is based on the fact that, unlike previous SmartNIC such as the ConnectX cards or SmartNICs equipped with Netronomes NPUs, the Bluefield DPU is able to run control plane applications in addition to data plane processing. However, the control plane can be logically centralized and physically distributed, as discussed in Section 3, and it is not specified that the controller must be located on a server. This means that multiple physically distributed single device control planes are conform with the SDN paradigm and can be realized on any capable NIC or switch. This is also conform with the definition presented in Section 6. Therefore, we would argue that DPUs are a separate category of network devices, but rather categorize them as a possible chip design on a SmartNIC (Figure 4 provides an overview).
Fig. 4.
Fig. 4. An overview of the different types of SmartNICs. Chip (1) is used for data plane offloading, but may also include GPP for control plane tasks. Chip (2) is optional and can be either an accelerator (e.g., GPU on NVIDIA A30X/A100X) or a CPU for control plane tasks. In the latter case, the optional management port (ETH MNGT) is connected directly to (2). On reconfigurable hardware, some components such as the MAC can be hardened as ASIC or soft-implemented on the FPGA fabric. Some many-core designs also have a dedicated cluster connected directly to the MACs. Similar designs can be found for switches. MAC, Media Access Control.
In the following, we will first discuss the use case of GPP (Section 8.1) for programmable network devices and then move on to programmable ASICs (Section 8.2), followed by FPGA (Section 8.3) and SoC (Section 8.4) designs. We conclude this section with a discussion of CGRAs and JIT, which to the best of our knowledge have not been discussed in the context of network architectures for INC, yet.

8.1 GPPs

As the GPPs architectures used in commodity servers offer a high degree of flexibility when processing data packets, they are widely used in data centers. However, their high flexibility comes at the price of low energy efficiency and latency that is difficult or impossible to predict. They are generally not able to provide the performance of inline packet processing alone. Therefore, they are typically not used as standalone processors for the data plane, but for control plane processing and slow path offloading instead. They are either connected to an accelerator as a standalone processor via a (PCIe) interface or are an integrated component/subsystem of an SoC. An example of a standalone processor together with an accelerator is the Cisco 3550-T. The Cisco 3550-T switch has all of its 48 25G SFP28-ports directly connected to an AMD/Xilinx Virtex UltraScale Plus FPGA with 8 GB High-Bandwidth Memory (HBM) 2 for data plane processing. The FPGA is connected via a PCIe Gen3 x8 interface to an Intel Atom processor with 8 cores and 16 GB DDR4 for control plane processing.
While the advantage of such a design is that all components are already available and therefore no custom chip design is required, the disadvantage is that the data transfer between accelerator and processor is limited by the interface (e.g., about 63 Gb/s data transfer without protocol overhead for PCIe Gen3 x8) and each transfer has a communication overhead that increases latency. Therefore, GPPs (with the exception of slow-path offloading) are not used for data plane processing on these devices. The trend is to use GP cores tightly coupled with dedicated (network) accelerators on a single SoC. Typical processor cores for this type of device are ARM or MIPS based. Further details on such designs are discussed in Section 8.4.

8.2 ASICs

An ASIC is a chip that is specialized and optimized for a specific task. Hardware developers focus only on implementing the minimum number of operations required to accomplish the task at hand. ASIC-based designs have the advantage of being able to perform their task with order of magnitudes better energy-to-performance ratio than a GPP [83], but they lack any flexibility. Since they also cannot be updated or fixed once they are manufactured and shipped, they have long and costly development cycles to ensure the correctness of the design.
Before the introduction of SDN and NFV, ASIC-based network devices with functionality tightly coupled to hardware were the state of the art—not only for power, but also for performance reasons. Earlier GPPs were simply not capable of processing packets at line speed, and in many cases still are not. As a result, many ASIC-based network chips use GPPs only for control plane functionality or to handle unusual/special cases of data plane packet processing (slow path offloading), while fast packet processing on the data plane packet is still implemented with ASICs. However, some ASIC-based network chips, such as the Intel Tofino (formerly Barefoot Tofino) [34], do offer some flexibility. These so-called programmable ASICs provide a basic level of flexibility by using MAT pipelines inspired by the RISC architecture. In addition, such designs are often equipped with programmable parsers and deparsers to enable protocol-independent packet processing according to the PISA architecture design. This allows network operators to define and use their own protocol, or switch to a different protocol if required, without having to replace expensive hardware, unlike fixed-function network chips.

8.3 FPGAs

FPGAs are reconfigurable devices consisting of Configurable Logic Blocks connected by a Switch Matrix. In general, any digital circuit can be designed on an FPGA, as long as enough resources are available. The downside of this flexibility is that FPGAs are larger and the achievable frequency is typically lower compared to ASICs. To at least partially address these issues, FPGAs typically also contain slices of specialized components such as Digital Signal Processing units and Block-RAM (BRAM).
Before P4, Xilinx had already targeted network programmability for their FPGAs. They proposed their own packet processing language called PX [24] for SDNet. SDNet allows the import of custom IP cores implemented in VHDL and Verilog by defining the interface to the user engine [23]. In addition, a C behavioral description must be provided for simulation. However, with the increasing popularity of P4, a \(P4_{16}\) compiler has been introduced into the SDNet environment. The Xilinx adaptation of the P4 compiler generates a JSON file for the runtime control software and the PX program, which is compiled as before by the SDNet compiler, which generates the Hardware Description Language (HDL) module [70]. In addition, a SystemVerilog testbench is generated. With the P4 \(\rightarrow\) NetFPGA design flow the P4-SDNet compiler was integrated by the NetFGPA project [70] to provide easier programmability of their NetFPGA-SUME boards without the need for HDL knowledge. NetFPGA PLUS [7] introduced a codebase using parts of the AMD/Xilinx OpenNIC project [2] to support AMD/Xilinx Alveo 200, 250, and 280 boards. With the introduction of Vitis Networking P4 (VNP4) [53], AMD/Xilinx provides direct P4 integration, generating SystemVerilog files for simulation and synthesis from a P4 description using the P4C-vitisnet compiler.
Since P4 is not a Turing-complete language, it is not possible to express many constructs required for FPGA data plane programming. However, since the introduction of extern in \(P4_{16}\), it is possible to interact with objects and functions provided by the architecture. This means it is possible to configure the FPGA and connect functions exposed to the data plane. Nevertheless, many features of modern FPGAs, such as Dynamic Partial Reconfiguration (DPR), cannot yet be used. Also, the incomplete standard for SmartNICs in P4 is an existing problem related to portability. The OpenNIC project [2] only partially addresses these problems. First, as the authors state, the OpenNIC project is not a full-featured SmartNIC solution. Second, the project originated as a research project within Xilinx Labs and targets only Xilinx FPGAs.
Research to provide programmability for the data plane has made significant progress. However, many features are still missing or proprietary. Also, programming FPGAs for network processing is still challenging and still requires sometimes lot of manual work and expert knowledge. For example, while it is possible to use HLS with OpenNIC, the RTL wrapper still has to be created manually because the interfaces generated by HLS are not directly compatible with the interfaces expected by OpenNIC.

8.4 SoC

SoC is a collective term for any design that realizes a system of multiple heterogeneous components interconnected on a single chip to solve specific tasks. Over time, SoC architectures have been proposed and realized for many domains, such as mobile processors with integrated graphics units, integrated tensor units for ML, or NPUs for network processing. In this section, we will discuss several SoC solutions used for network processing. Since in our opinion the term SoC is too broad to describe the different architectural approaches, we use the following categorization for SoC on programmable network devices:
(1)
RSoC. An SoC that combines a PS consisting of GP cores with the PL of reconfigurable hardware such as an FPGA or CGRA. The cores of the PS are primarily used for control plane tasks.
(2)
Multi-Core SoC (MuSoC). An SoC that uses multiple high-speed GP cores as the main component. The cores are used for both data plane and control plane processing.
(3)
Many-Core SoC (MaSoC). An SoC with a large number of programmable but simple dedicated cores for data plane processing to achieve a high degree of parallelism. They may also be equipped with a single or small number of GP cores used primarily for control plane tasks.
All three SoC categories can be equipped with hard implemented accelerators, e.g., for cryptography.
The industry often uses other terms such as NPU, DPU, and Infrastructure Processing Unit (IPU). According to the SmartNIC Summit definition [29] and similar definitions from NVIDIA [31], a DPU should have the following characteristics:
Industry-standard, high-performance, software-programmable (multi-core) CPU.
High-performance network interfaces capable of parsing, processing, and transferring data at line-rate.
Flexible and programmable acceleration engines for applications such as ML, security, telecommunications, and storage.
While this definition generally fits the MuSoC well, we can fit most industry proposed DPUs into this category. However, other proposals, such as the DPU on the AMD/Pensando DSC, fit better into the MaSoC category in our opinion.
According to the definition provided by the SmartNIC Summit [29] and Intel, we see a strong overlap between DPU and IPU. The only difference is that the DPU definition emphasizes that industry-standard CPUs are used, while the IPU definition focuses more on the fact that the accelerators are hardened and tightly coupled to the dedicated programmable cores. Looking at the devices offered by Intel under the umbrella of this term, we can see that it was mainly defined in the context of the Mount Evans architecture [112]. However, other products offered by Intel under this term do not fit this definition. Intel also uses this term for the F2000X-PL (Intel Agilex 7 F) and C5000X-PL (Stratix 10 DX) boards. Both boards are SmartNICs with an Intel Xeon-D connected via an on-board PCIe interface to an FPGA-based SoC with a 64-bit quad-core Arm Cortex-A53. This contradicts the definition of an IPU, which requires the accelerators to be hardened. Table 1 shows the classification for some upcoming and already available SoC based on our categorization.
Table 1.
ClassManufactureChip(-Family)Language/FrameworkArchitecture
RSoCAMD/XilinxAlveo U25NVitis (HLS, Verilog, VHDL, P4)XCU25N SoC with FPGA fabric and 4x Arm Cortex-A53
 IntelN6000-PLDPDK, FlexRAN (vRan only), OPAE, OFS,AGF014 SoC with FPGA fabric and 4x Arm Cortex-A53
  (N6010/6011)Intel Quartus Prime Pro Edition 
MuSoCMicrosoft/FungibleS1eBPF, C16x MIPS64 R6 Cores
 Marvell/CaviumOCTEON 10P4, eBPF, DPDK8–24x Arm Neoverse N2
 NVIDIA/MellanoxBlueField-2DOCA (SPDK, DPDK, P4, Netlink)8x Arm Cortex-A72
  BlueField-3DOCA (SPDK, DPDK, P4, Netlink)8x or 16x Arm Cortex-A78
MaSoCAMD/PensandoCapriP4, DPDK4x Arm Cortex-A72, 112 MPUsa
  ElbaP4, DPDK16x Arm-Cortex A72, 144 MPUs
 NetronomeNFP-4000P4, C, DPDKArm11-Core \(+\) 4 FPC; 48 PPC (In- and Egress); 5x 12 FPCb
  NFP-6000P4, C, DPDKArm11-Core \(+\) 4 FPC; 96 PPC (In- and Egress); 10x 12 FPC
 Marvell/CaviumOCTEON IIIP4, eBPF, DPDK2x MIPS64 R5 \(+\) 48x cnMIPS64 v3
  (CN7890)  
 Microsoft/FungibleF1eBPF, C4x MIPS64 (2xSMT) \(+\) 8x 6 MIPS64 (4xSMT)
Table 1. Examples of SoC Classified Based on the Proposed Categorization
a
On both Capri and Elba, the Media Access Control (MAC) units are connected to a central packet buffer. The Match Processing Units (MPUs) are organized into four pipelines. The ingress and egress pipelines are connected to the central packet buffer. In addition to the ingress and egress pipelines is a pipeline from the host direction (TxDMA) and a pipeline to the host direction (RxDMA) are also provided [60]. Both TxDMA and RxDMA conceptually are similar to the ingress and egress pipelines with DMA engines and scheduler (TxDMA) instead of parser/deparser.
b
Netronomes Network Flow Processor (NFP) is based on Intel’s IXP28xx Network Processor (NP) architecture [55]. The main components of the SoC are Micro-Engines (MEs), which are software programmable multi-threaded 32-bit RISC cores. The MEs are divided into two classes, Flow Processing Cores (FPCs) and Packet Processing Cores (PPCs) [11] grouped into islands. The PPC island is available in both ingress and egress direction and directly connected to the MAC units. Packets are processed by one or more PPCs groups in a pipeline [11].

8.5 CGRA

To the best of our knowledge, CGRAs in the context of programmable network devices have not yet been realized as own chip design and are not widely discussed in this context. The only exception is [113], which proposed a CGRA-based prototype on an FPGA for the data plane of programmable switches (details are discussed in Section 9.3). Nevertheless, we want to discuss this architecture, which has gained recognition as an alternative design for reconfigurable hardware compared to FPGAs for some domains, such as ML.
Research on CGRAs has been ongoing for about 30 years, dating back to the early 1990s [48, 66]. Unlike the fine-grained architecture of FPGAs, CGRAs are, as the name implies, a coarse-grained architecture. The higher granularity allows the construction of optimized Processing Elements (PEs) that require less area and power and can operate at a higher clock frequency. These PEs (also called Reconfigurable Cells or Functional Units (FUs) depending on the literature) are the smallest reconfigurable unit of a CGRA and are interconnected by programmable switches.
The higher degree of granularity allows reconfiguration times at a range of \(ns\) to \({\mu}\textrm{s}\) compared to FPGAs with reconfiguration times in the range of \(ms\) to \(s\). Therefore, temporal computation can be better exploited with CGRAs than with FPGAs [83]. However, the higher degree of granularity is accompanied by a loss of flexibility. Both CGRAs and FPGAs belong to the data-flow and configuration-driven architectures in the context of their execution model. Because CGRAs are less flexible, they are also considered part of a Domain Specific Accelerator classification [83], while FPGAs provide general flexibility. Even though CGRAs seem to be a good compromise between ASICs with fixed functionality and little to no flexibility and FPGAs with a high degree of flexibility, it is still an immature architecture that has to deal with a variety of problems. First, there are no high-level compilers that are well optimized for this architecture. The gap between manually and compiler optimized code is large. Liu et al. [83] suggest that it would be more efficient to use a lower-level programming model that exposes more architectural details to the programmer rather than relying on compiler optimization. However, this approach is not practical for many real-world (commercial) scenarios where time-to-market and low-cost development are more important than achieving the highest possible performance for the architecture. Therefore, we believe it is more important to develop compilers that are able to leverage the architectural features of CGRAs, for example, based on the Multi-Level Intermediate Representation (MLIR) compiler infrastructure [80] and providing libraries with optimized components, which is also discussed in [115].
Another approach that tries to take advantage of the fast reconfigurability of CGRAs is to use CGRA-like overlays for FPGAs. Jain et al. [72] proposed a coarse-grained overlay for JIT compilation on FPGAs by using the Clang compiler to transform OpenCL kernels into an LLVM IR to extract the Data Flow Graph (DFG). The nodes of the DFG are then mapped to the FUs of the overlay. This is used for FU netlist generation and Place and Route (PnR) on the overlay. Their proposed design improved PnR time by orders of magnitude, from several hundred seconds when using the FPGA fabric to a fraction of a second when using the overlay on a workstation. Even performing PnR on the ARM Cortex-A9 of the Zynq 7020 SoC took less than a second in most cases.
Zamacola et al. [130, 131] proposed a multi-grain overlay for FPGAs based on the CGRA-ME framework [49]. Their proposal builds on their previous work called IMPRESS [129], which supports multi-grain granularity for Xilinx 7 series FPGAs using DPR. In [130], they extend their work by integrating a modified version of CGRA-ME for their mapping tool. Like [72], they use the Clang compiler to transform the code into an LLVM IR to extract the DFG, which is then mapped onto the overlay. However, while [72] uses only a coarse-grained overlay, [130] supports two levels of granularity: a fine-grained level for configuring the FUs and a medium-grained overlay for assembling (or stitching, as they call it) the FUs.
As discussed in Section 5, intermediate nodes have limited computational capabilities. Therefore, they require hardware accelerators to provide the performance needed to offload applications to the network. While programmable ASICs like the Tofino can provide a high degree of performance, they lack the flexibility to meet the needs of many applications. Building a custom ASIC is time-consuming and costly, and can only provide a limited number of accelerators that must be clearly defined in advance, which is not feasible in most real-world scenarios. While highly flexible reconfigurability is one of the main strengths of FPGAs when it comes to creating your own custom design without going directly to an ASIC, it comes (in addition to high design times) with the disadvantage of high reconfiguration time (\(ms-s\)). CGRAs, on the other hand, have a reconfiguration time that is orders of magnitude (\(ns-{\mu}\textrm{s}\)) lower than FPGAs [83].
In order to provide acceleration for a wide range of applications in a dynamic network environment, we believe that fast accelerator exchange will be mandatory for future networks utilizing INC. Therefore, we believe that more research in the direction of implementing CGRAs as an actual integrated architecture for network devices, as well as a coarse-grained or multi-level grained overlay for FPGA-based network devices using technologies such as JITs, could be beneficial for accelerating algorithms in the network.

9 Exploring FPGA-Based Designs for the PDP

In this section, we will present research focused on offloading applications to FPGA-based network devices. Section 9.1 presents research on SmartNIC and switch hardware designs independent of the application domain, Section 9.2 provides research on frameworks and compilers for automatic offloading of eBPF/XDP programs to SmartNIC, and Section 9.3 concludes this section by presenting research on application offloading to the network for different domains. Table 2 provides a summary of the work presented and Table 3 provides important results.
Table 2.
ScopeProposalYearTarget PlatformDetails
DNSP4DNS [124]2019Switch (FPGA)—P4 implementation of DNS service.
—Control plane communication via DMA.
—Use of P4 \(\rightarrow\) NetFPGA workflow.
—Comparison with Emu based on reported in [111].
—Provides more features than Emu DNS.
CPVariant of PBFT [105]2020SmartNIC and FPGA—Implementation and evaluation of different configurations for PBFT.
—Study for SmartNIC offloading and implementing of PBFT completely on an FPGA.
Caching/KVSLaKe [117]2018Switch/NIC (FPGA)—HW implementation based on the concept of the Memcached system.
—Multilevel cache architecture using on-chip and on-board memory.
—Comparison with Emu based on reported in [111].
—Slightly lower cache-hit latency compared to Emu.
   \(\quad\circ\) Emu does not have a cache hierarchy.
 Emu [111]2017Switch/NIC (CPU, FPGA)—C# library for Kiwi compiler.
—C# code is transformed to Verilog code by Kiwi.
—Additional comparison with NetFGPA and P4FPGA [121].
 PANIC [81]2020SmartNIC (ASIC/FPGA)—Heterogeneous CUs (accelerator or processor)
   \(\quad\rightarrow\) Support of hardware acceleration and software offloads
—CUs connected over a crossbar to a central hardware packet scheduler using a PIFO.
—Uses Corundum’s [58] NIC driver, DMA engine, MAC, and PHY.
—FPGA prototype and ASIC implementation analysis.
General/Multi-DomainSuperNIC [82]2024SmartNIC (FPGA)—Mapping of task DAG to reconfigurable regions on FPGA.
—Reconfigurable regions/CUs are connected over a crossbar to a central scheduler.
—CU can contain a chain of multiple tasks. Wrapper around each task for bypassing.
—Subgraphs (virtual task chain) extracted from a tenant’s DAG are scheduled to tasks in CU.
—Supports tenant CU sharing, replication, and virtual task chain parallelism.
—Fairness mechanism based on space sharing using DRF approach [62] and time-sharing.
 MTPSA [109]2020Switch (CPU, FPGA)—Proposals of security isolation mechanisms for the PSA architecture.
—Separation of superuser- and user pipelines. Concept of roles taken from OSs.
—Proposal to encapsulate user pipelines in superuser egress pipeline (between parser and MAT).
—Encapsulated packet decapsulated by superuser- and processed by user pipeline.
—User program opaque to superuser and other user (P4) programs.
 Terabit Switch Virtualization [102]2021Switch (ASIC-FPGA)—Proposed single SoC solution with network switching logic as ASIC and (embedded) FPGA logic for adaptable packet processing and switch virtualization.
—Based on their P4VBox reference design [101].
—Analysis of ASIC fabric and evaluation of parallel running network application on the FPGA.
 hXDP [46]2020SmartNIC (ASIC/FPGA)—Offloading of XDP programs to a SmartNIC.
—Custom VLIW processor with accelerators.
—Custom compiler to optimize the eBPF byte code for the offload engine.
—Implemented on FPGA, but fixed design \(\rightarrow\) can be implemented as ASIC.
eBPF/XDP OffloadhXDP \(+\) WE [42]2022SmartNIC (ASIC/FPGA)—Extends hXDP with a MAT pipeline (Warp Engine (WE)) in front of the hXDP offload engine.
—Custom compiler in front of the hXDP compiler:
   \(\quad\rightarrow\) Identifies parts to be offloaded to the MAT pipeline.
   \(\quad\rightarrow\) Rest is compiled and executed by the hXDP part.
—Integration of hXDP (\(+\) Warp Engine) in Corundum and ported to Alveo U50.
—Fixed latency of 28 clock cycles (112 ns @250 MHz) for WE.
 eHDL [97]2023SmartNIC (FPGA)—Generation of a hardware pipeline based on the analysis of the eBPF byte code.
   \(\quad\circ\) Compiler translates eBPF byte code to VHDL code.
—Unlike hXDP, resource utilization depends on the application (Figure 5)
 Taurus [113]2022Switch (ASIC, CGRA)—Proposal for CGRA-based switch architecture for ML.
—Prototype using Tofino switch for MAT pipeline and FPGA for CGRA implementation.
—FPGA is connected over Ethernet with the Tofino switch.
—CGRA used for MapReduce \(\rightarrow\) Can be bypassed for normal PISA flow.
—Training on control and inference on data plane (Taurus).
Machine LearningHomunculus [114]2023Switch (ASIC, CGRA)—Framework for mapping ML models to supported switch targets (Tofino, P4-SDNet and Taurus).
—Python Front-end: Provides functions that can be used to integrate existing libraries such as TensorFlow.
—Middle-end: HyperMapper to optimize the configuration file.
—Back-end: Generates code (Spatial [78] and P4).
 NetReduce [85]2023Switch (ASIC/FPGA)—Accelerating aggregation in the network for ML.
—Prototype: External FPGA attached to a commodity switch.
—Discusses the implementation of the proposed design as ASIC.
—Discussion of the limitations of (P4) programmable switches.
SecurityPigasus [132]2020SmartNIC (FPGA)—Single server IDS/IPS.
—Parser, reassembler, and MSPM on FPGA.
—Regular expression and full match stages on host.
Table 2. Overview of the Proposals Discussed, Including Target Platform and Important Details
DNS, Domain Name Server; MSPM, Multi String Pattern Matcher; MTPSA, Multi Tenant Portable Switch Architecture; PBFT, Practical Byzantine Fault Tolerance; PIFO, Push In First Out; VLIW, Very Long Instruction Word.
Table 3.
ProposalEvaluation SetupRelevant Results
Emu DNS [111]—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4-RAM
 \(\quad\circ\) NIC: Intel 82599ES (10 GbE)
—NetFPGA-SUME: Emu
—Traffic Source: OSNT [36]
—Throughput (Host): \(\color{green}\uparrow\) \(\approx 5.2x\)
—Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/66.5x\)
—Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/74.4x\)
P4DNS [124]—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB RAM
 \(\quad\circ\) NIC: Solarflare SFC9220 (10 GbE)
 \(\quad\circ\) Software: NSD [12]
—NetFPGA-SUME: P4DNS
—Traffic Source: OSNT [36]
—Throughput (NSD): \(\color{green}\uparrow\) \(\approx 52x\)
—Throughput (Emu): \(\color{green}\uparrow\) \(\approx 10x\)
—Latency (\(99\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/54.2x\)
—Latency (\(99\)th-perc., Emu): \(\color{red}\uparrow\) \(\approx 1.8x\)
—Latency (\(50\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/36.7x\)
Variant of PBFT [105]—Cluster: 24 machines, each Intel Xeon E-2186G
 \(\quad\circ\) 10 GbE network
—Consensus group size: 15 nodes
—Acceleration of data movement and hashing are more beneficial than crypto only acceleration.
—Optimal goodput depends on fine granular batching.
—Area efficiency (Throughput/Area, Intel):
 \(\quad\circ\) \(\color{green}\uparrow\) \(1.5x\) (Packet Filter) - \(248x\) (KV)
Emu Memcached [111]—Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4
 \(\quad\circ\) NIC: Intel 82599ES (10 GbE)
—NetFPGA-SUME: Emu
—Traffic Source: OSNT [36]
—Throughput (Host): \(\color{green}\uparrow\) \(\approx 2.2x\)
—Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/20.1x\)
—Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/22.7x\)
LaKe [117]—Host: Intel Core i7-4770, 64 GB RAM
 \(\quad\circ\) Software: Linux Memcached
—NetFPGA-SUME: LaKe
—Traffic Source: OSNT [36]
—Throughput (Host): \(\color{green}\uparrow\) \(\approx 13.6x\)
—Throughput (Emu): \(\color{green}\uparrow\) \(\approx 6.8x\)
—Latency (Cache Hit, Host): \(\color{green}\downarrow\) \(\approx 1/205x\)
—Latency (Cache Hit, Emu): \(\color{red}\rightarrow\) \(\approx 1x\)
—Latency (Cache Miss, Host): \(\color{green}\downarrow\) \(\approx 1/42x\)
—Latency (Cache Hit vs. Miss, LaKe): \(\color{red}\uparrow\) \(\approx 4.6x\)
PANIC [81]—Server 1: Dell PowerEdge 640
 \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE)
 \(\quad\circ\) Data Source: DPDK custom packets
—Server 2: Dell PowerEdge 640
 \(\quad\circ\) ADM-PCIE-9V3 (VU3P-2 FPGA): PANIC
 \(\quad\circ\) Use open-source IPs as CUs: RISC-V @250 MHz, AES-256 @250 MHz, SHA-3 @150 MHz
—ConnectX-5 directly connected with PANIC
—Achieved frequency: 250 MHz
 \(\quad\circ\) Exception PIFO: 125 MHz
—11.27% LUT utilization
 \(\quad\circ\) High utilization by PIFO
—8.94% BRAM utilization
—Achieved throughput: 100 Gbps (linerate)
 \(\quad\circ\) Maximum on-chip bandwidth: 256 Gbps
—Latency: Scheduling, load and CU dependent
 \(\quad\circ\) \(\lt 16\,{\mu}\textrm{s}\) in evaluation
SuperNIC [82]—Evaluation using Verilator Simulation (\(s\)) and Testbed (\(t\)).
—Testbed (KVS use case): Cluster connected over a 100 GbE switch
 \(\quad\circ\) SuperNIC: HiTech Global HTG-9200 (AMD/Xilinx VU9P, 9x100 GbE) (\(A^{*}\))
 \(\quad\circ\) 2 Server: Dell PowerEdge R740 (Xeon Gold 5128) with NVIDIA/Mallenox ConnectX-4 NIC (100 GbE) (\(B\): Clover [118], \(C\): HERD [74]) and NVIDIA/Mallenox BlueField-Gen1 NIC (100 GbE) (\(D\): HERD [74])
 \(\quad\circ\) AMD/Xilinx ZCU106 Evaluation-Board (10 GbE): Clio [63] (\(E\))
—Re-implementation of PANIC [81] to HTG-9200 as baseline.
—Evaluation: Chains with dummies (\(d\)) and chains consisting of tasks (\(r\)), including: Firewall, KV-Cache, NAT, load balancing, forwarding, and AES.
—Parallelism: \(S1\) \(=\) None, \(S2\) \(=\) DAG Parallel, \(S3\) \(=\) \(S2\) \(+\) Instance Parallel \({}^{*}\) \(A+E\) here with caching NT (also included in the paper without)
—Achieved frequency for most modules: 250 MHz
—Minimum latency (ingress to egress): \(1.3{\mu}\textrm{s}\)
—Beneficial to have short running NTs together in a single CU.
—Time sharing improves area utilization compared to DRF [62] only.
—Achieved throughput: 100 Gbps (linerate)
—Throughput (\(s/d\), \(S1\) vs. \(S2\)): \(\color{red}\rightarrow\) \(\approx 1x\)
—Throughput (\(s/d\), \(S1/S2\) vs. \(S3\)): \(\color{green}\uparrow\) \(\approx 1x-1.5x\)
—Latency (\(s/d\) and \(s/r\), PANIC): \(\color{green}\downarrow\) \(\approx 1x-0.6x\)
—Throughput (\(s/r\), PANIC): \(\color{red}\rightarrow\) \(\approx 1x\)
\(A+E\) vs. \(E\): Lat. \(\color{green}\downarrow\) \(\approx 0.6x-0.8x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\)
\(A+E\) vs. \(B/C\): Lat. \(\color{green}\downarrow\) \(\approx 0.5x-0.6x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\)
\(A+E\) vs. \(D\): Lat. \(\color{green}\downarrow\) \(\approx 0.3x-0.4x\); Thp. \(\color{green}\uparrow\) \(\approx 2.9x-3.8x\)
MTPSA [109]—SW-Target: BMv2 (Functional evaluation in Simulation with Mininet.)
—HW-Target: NetFPGA SUME (L2-Switch)
 \(\quad\circ\) Traffic Source: OSNT [36]
\(\quad\circ\) Packet size: 64B–1,518B (Results only for 64B reported).
—Comparison with P4 \(\rightarrow\) NetFPGA PSA as reference design.
—Comparison with MTPSA without (MTPSA\({}_{0}\)) and up to eight user programs (MTPSA\({}_{x},x\in\{2,3,4,8\}\))
—Achieved maximum throughput of NetFGPA SUME (40 Gbps).
—Latency (Ref. \(\rightarrow\) MTPSA \({}_{0}\):): \(\color{red}\uparrow\) \(1.7{\mu}\textrm{s}\rightarrow 2.52{\mu}\textrm{s}\)
—Latency (MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\)): \(\color{red}\uparrow\) \(2.52\mu\textrm{s}\rightarrow 3.23{\mu}\textrm{s}-3.3{\mu}\textrm{s}\)
—Relatively stable latency \(\rightarrow\) (Mainly) user program dependent.
—Ref. \(\rightarrow\) MTPSA \({}_{0}\): Logic \(\color{red}\uparrow\) 10%; Memory \(\color{red}\uparrow\) \(7.65\%\)
—MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Logic, per program): \(\color{red}\uparrow\) \(5.9\%-7.4\%\)
—MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Memory, per program): \(\color{red}\uparrow\) \(5.4\%-6.3\%\)
Tb Sw. Virtual. [102]—ASIC logic (65 nm): Synopsis Design Compiler
—FPGA logic: AMD/Xilinx XCVU13P
 \(\quad\circ\) Generated vSwitch instances with P4 \(\rightarrow\) NetFPGA.
 \(\quad\circ\) Use case: L2-Switch (26x), Firewall (17x), Router (14x), INT (14x)
—Frequency (ASIC): 1 GHz; Total Area: \(47.6 \mathrm{mm}^{2}\); Power: \(28.3 \mathrm{W}\)
—Frequency (FPGA): 718.4 MHz
—Throughput (per instance): \(129.61-132.63\) Gbps
—Possible throughput (total): \(\approx 1.43-3.45\) Tbps (Max: \(3.2\) Tbps)
hXDP [46]—Host: Intel Xeon E5-1630 v3
—Netronome NFP-4000 @800 MHz: XDP offload
—NetFGPA-SUME: hXDP @156 MHz
—Evaluation applications (Firewall and Katran) and Microbenchmarks.
—Evaluation Packet Forwarding: 64B–1,518B
\(\approx\) 18% of logic resource utilization
—Throughput (Applications):
 \(\quad\circ\) Host @2.1 GHz: \(\color{green}\uparrow\) \(\approx 1.08x\)\(\approx 1.55x\)
 \(\quad\circ\) Host @3.7 GHz: \(\color{red}\downarrow\) \(\approx 0.88x\)\(\approx 0.62x\)
—Forwading Latency (64B–1,518B):
 \(\quad\circ\) Host @3.7 GHz: \(\color{green}\downarrow\) \(\approx 1/8.3x\)\(\approx 1/10.7x\)
 \(\quad\circ\) NFP-4000: \(\color{green}\downarrow\) \(\approx 1/1.1x\)\(\approx 1/3.5x\)
—hXDP has higher throughput for TX- or redirection.
hXDP\(+\) WE [42]—AMD/Xilinx Alveo U50 @250 MHz: hXDP
—AMD/Xilinx Alveo U50 @250 MHz: hXDP \(+\) WE
—Logic (hXDP): \(\color{red}\uparrow\) \(\approx\) 51.4% (\(\approx\) 13.3% total)
—Memory (hXDP): \(\color{red}\uparrow\) \(\approx\) 43.5% (\(\approx\) 11.5% total)a
—Instruction reduction (for hXDP): \(\color{green}\downarrow\) \(\approx\) 16.3% - 100%.
—Throughput (hXDP only): \(\color{green}\uparrow\) \(\approx 1.2x-3.1x\)
 \(\quad\circ\) Complete offload of Suricata to WE: \(\color{green}\uparrow\) \(\approx 18.2x\)
—Latency (hXDP only): \(\color{red}\uparrow\) \(\approx 1.01x-1.1x\)
 \(\quad\circ\) Exception Katran (\(\color{green}\downarrow\) \(\approx 0.98x\))
eHDL [97]—AMD/Xilinx Alveo U50: eHDL
—AMD/Xilinx Alveo U50: hXDP
—AMD/Xilinx Alveo U50: P4-SDNet
—NVIDIA/Mallenox BlueField 2
—Throughput (SDNet)b: \(\color{red}\rightarrow\) 1x
—Throughput (hXDP): \(\color{green}\uparrow\) \(\approx 27.4x-164.4x\)
—Throughput (BlueField 2, 4 Cores)c: \(\color{green}\uparrow\) \(\approx 11.7x-23.4x\)
—Latency (Avg., hXDP): \(\color{red}\uparrow\) \(\approx 1.03x\)
 \(\quad\circ\) \(\color{green}\downarrow\) \(\approx 0.9x\) (Firewall) \(-\) \(\color{red}\uparrow\) \(\approx 1.2x\) (Router)
—Latencyd (BlueField 2): \(\color{green}\downarrow\) \(\approx 0.1x\)
Taurus [113]—2 Server, each with an Intel Xeon Gold 6248 @2.5 GHz
 \(\quad\circ\) MoonGen: Traffic generator and traffic sink.
—Taurus Switch Prototype:
 \(\quad\circ\) Wedge 100 BF-32X (Tofino Switch)
 \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation)
—Case study: Anomaly Detection
 \(\quad\circ\) Inference on the CGRA
 \(\quad\circ\) Baseline inference on the control plane
 \(\quad\circ\) 5 Gbps traffic
 \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps
 \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps
—Estimated area overhead of \(3.8\%\) for ASIC implementation
—Latency:
 \(\quad\circ\) Baseline: 34 ms (100 Kbps) - 512 ms (100 Mbps)
 \(\quad\circ\) Taurus (Avg.): 122 ns
—Detection rate:
 \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 2.6% (1 Mbps)
 \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 58.2% (100 Mbps - 100 Kbps)
—F1 score:
 \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 4.9% (1 Mbps)
 \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 71.1% (100 Mbps - 100 Kbps)
—Online Training convergence fastest with higher sampling rate (\(10^{-2}\)), higher number of epochs (10) and small batch sizes (64).
Homunculus [114]—2 Server: Intel Xeon Gold 6248 @ 2.5 GHz
 \(\quad\circ\) MoonGen: Traffic generator and traffic sink.
—Taurus Switch Prototype:
 \(\quad\circ\) Wedge 100BF-32X (Tofino Switch)
 \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation)
—Case studies: Anomaly Detection, Traffic Classification, and Botnet Chatter Detection.
 \(\quad\circ\) Ideal F1 score calculated offline in SW.
—Achieved line rate for all three applications.
—Achieved ideal F1 score for all three applications.
NetReduce [85]—Single-GPU: six machines each
 \(\quad\circ\) 2x Intel Xeon E5-2064 10x @2.4 GHz
 \(\quad\circ\) 3x 32GB DDR4
 \(\quad\circ\) NVIDIA Geforce RTX 2080 8 GB
 \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE)
—Multi-GPU: four machines each
 \(\quad\circ\) 2x Intel Xeon Gold 6154 18x @3.00 GHz
 \(\quad\circ\) 16x 64GB DDR4
 \(\quad\circ\) 8x NVIDIA Tesla V100 SXM2 32 GB,
 \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE)
—Evaluation:
 \(\quad\circ\) Image Classification (ImageNet dataset): AlexNet, VGG16, and ResNet50.
 \(\quad\circ\) NLP: BERT, GPT-2, MNLI, QNLI, QQP and SQuAD
 \(\quad\circ\) Comparison with RAR, SwitchML, FAR, and TAR.
—Throughput (Single-GPU):
 \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.05x\ -\approx 1.45x\)
 \(\quad\circ\) ImageNet, SwitchML: \(\color{green}\uparrow\) \(\approx 1x\ -\approx 1.25x\)
 \(\quad\circ\) NPL, RAR: \(\color{green}\uparrow\) \(\approx 1.22x\ -\approx 1.43x\)
—Throughput (Multi-GPU):
 \(\quad\circ\) ImageNet, FAR: \(\color{green}\uparrow\) \(\approx 1.15x\ -\approx 1.69x\)
 \(\quad\circ\) ImageNet, TAR: \(\color{green}\uparrow\) \(\approx 1.12x\ -\approx 1.58x\)
—Communication Improvement (Multi-GPUe):
 \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.16x\ -\approx 1.34x\)
—Accuracy Lost (Single-GPU)
 \(\quad\circ\) ImageNet, RAR: \(\color{red}\downarrow\) \(\approx 0.2\%\ -\approx 1.5\%\)
Pigasus [132]—Intel i7-4790 @3.60 GHz: Snort 3 SW only
—Intel i9-9960X @3.1 GHz: Snort 3 SW
 \(\quad\circ\) Intel Stratix 10 MX: Pigasus
—Number of Snort cores for 100 Gbps:
 \(\quad\circ\) IDS: \(\color{green}\downarrow\) 23–185x less than SW only
 \(\quad\circ\) IPS: \(\color{green}\downarrow\) 23–200x less than SW only
—Latency (SW only): \(\color{green}\downarrow\) \(1/3x\) \(-\) \(1/10x\)
 \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\)
—Estimated Power consumption: 49–166 W
 \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\)
Table 3. Overview of the Setups and Achieved Results of the Proposals Discussed
a
The first number is the relative increase in utilization compared to hXDP. Total resource utilization in parentheses.
b
No implementation for DNAT with SDNet \(\rightarrow\) Problem with dynamic port selection implementation in P4.
c
Only one and four core reported.
d
Latency for SDNet not reported.
e
Use of only one NVIDIA Tesla V100 per machine in this scenario.
KVS, Key Value Store; LaKe, Layered Key Value Store; LUT, Lookup Table.

9.1 SmartNIC and Switch Designs for Multi-Domain Applications on FPGAs

Lin et al. [81] present with PANIC a multi-tenancy SmartNIC design implemented on a ADM-PCIE-9V3 board equipped with a AMD/Xilinx Virtex UltraScale+ (XCVU3P-2) FPGA.
The incoming packet is received by a single-stage RMT module that is configured at design time. This means that, unlike the RMT stages of the PISA, it cannot be changed online. The module matches the IP address and port fields and maps them to a descriptor for further processing. The packet is then stored in the PB and the created descriptor is written to the scheduler’s Push-In First-Out (PIFO)-based priority queue [106].
The Central Scheduler supports both pull-based scheduling and push-based scheduling with chaining. The system also uses a credit-based system to keep track of the current workload of each CU. Depending on the workload, the scheduler operates in pull mode (high workload) or push mode (low workload). The push and pull mode used here uses the same concept established for push and pull mode in producer-consumer designs, such as those used in cloud messaging services. In pull mode, CUs actively request packets and must wait until the request is complete before they can start working. In push mode, the scheduler actively pushes packets to the CUs. Using chaining, a CU can push a packet directly to the next CU in the chain after processing if that CU’s buffer is not full, otherwise the packet is written back to the PB.
A CU can be a hardware accelerator or a processor core. The authors used two hardware accelerators (AES and SHA3) and a RISC-V softcore for their prototype. Both implemented hardware accelerators are based on IPs provided by OpenCores [13, 14]. For the softcore, the open source RISC-V core generator VexRiscv [26] was used. All components of PANIC are connected via a Crossbar build with the open source network-on-chip router RTL [40].
All components are implemented in Verilog and can run at a frequency of 250 MHz. The only exception is the PIFO which runs at 125 MHz. The reason for the lower frequency for the PIFO is, according to the authors, the poor scalability of the PIFO on FPGA. For the PIFO they could not use the BRAM of the FPGA and had to rely on Lookup Table (LUT) only, which has a strong impact on the available area utilization. With a data width of 512 bits, the crossbar can support a throughput of up to 128 Gbps per port, exceeding the 100 Gbps of the MAC. To transfer packets to the host memory, PANIC uses the DMA engine provided by Corundum [58]. They also use the Corundum NIC driver on the host side.
SuperNIC, proposed by Lin et al. [82], shares similarities with PANIC [81]. Both investigate task offloading for multi-tenants to CUs on a SmartNIC. In both designs, CUs are interconnected with each other and a credit-based central scheduler over a crossbar, and they support task chaining between CUs. However, there are several key differences.3
First, PANIC focused on prototyping a SmartNICs on FPGA with a general fixed architecture, potentially implementable as an ASIC. In contrast, SuperNIC consists of three components: (a) FPGA logic for user-offloaded network computation; (b) ASIC logic for fixed system tasks (packet transmission, scheduling, and reception); and (c) GP cores for executing control plane tasks in software. While the latter two components are common in many SmartNICs, Lin et al. [82] primarily focused on offloading network computation to the FPGA logic, in the form of NTs described by a Directed Acyclic Graph (DAG), including the reconfigurable nature of FPGAs in their consideration.
The second major difference relates to a key concept in SuperNIC’s design: the abstraction of NTs into Virtual NT chains. While PANIC provides a single CU for each task, leading to limited scalability as the number of tasks grows, SuperNIC maps a chain consisting of multiple connected tasks (physical NT chain) to a single reconfigurable CU, called NT region. A virtual chain is a subset of tasks from a DAG. However, while a virtual chain can contain tasks available in a CU others may not be part of it. To avoid frequent reconfiguration or overly fine granularity (which would lead back to PANIC’s scalability issues), the authors propose the concept of bypassing or “skipping” tasks in a chain. To support this mechanism, each task in a CU is augmented with a wrapper.
For better performance and utilization, tenants can also share a CU, i.e., if a task in a CU is not needed by one tenant, it can be used by another tenant at the same time. In addition, SuperNIC supports execution of multiple virtual chains belonging to a single DAG (DAG Parallelism) in parallel to reduce execution time and replication of CUs (Instance Parallelism) to increase throughput.
To ensure fairness, they provide the concepts of space sharing and time sharing. For space sharing, they determine the number of instances of an NT chain to start and the amount of onboard memory to allocate to each user, using the DRF approach [62]. For time sharing, the allocated resources from the space sharing step are considered fixed, and a virtual start and end time is assigned to each packet, allowing time sharing of CUs, ingress and egress bandwidth, and buffers.
Unlike NICs, switches are connected to a much larger number of network nodes, making them ideal targets for in-network application acceleration. However, adding compute logic can impact not only individual nodes but also overall network performance for traffic routed through the switch. Additionally, switches typically manage a larger number of tenants compared to NICs.
Stoyanov and Zilberman proposed Multi-Tenant Portable Switch Architecture (MTPSA) [109], an extension for the P4 PSA with OS-like roles and privileges for security isolation. The pipelines generally follow the PSA definition. However, the ingress and egress pipeline belongs to the superuser (typically the network operator), and the egress pipeline is extended by an encapsulated user (sub-)pipeline between parser and MAT stages of the superuser egress pipeline. Superuser header vector and metadata bypass the user pipeline. Encapsulated user packets (e.g., using VXLAN) are decapsulated by the superuser pipeline and processed by the user pipeline. This allows opaque processing of user programs without interference or knowledge by or from the superuser pipeline or other user programs. They extended P4C with a back-end compiler for BMv2 SW-target to generate superuser and user pipelines separately. Due to SDNet limitations, they adapted the design for NetFPGA, placing the user pipeline before the superuser egress pipeline.
Saquetti et al. [102] proposed an SoC consisting of ASIC and FPGA logic components based on their prior work P4VBox [101] for data plane virtualization. The core of their proposed architecture is an array of reconfigurable vSwitch realized on the FPGA logic. The input and output queues, buffers, interfaces, and vSwitch queues are implemented as ASIC. Additionally, the SoC provides a management interface to an external off-chip controller. In their evaluation, they achieved up to \(3.2\) Tbps for the ASIC implementation and \(129.61-132.63\) Gbps for a single vSwitch instance. However, they were constrained by the limited on-chip BRAM of the XCVU13P FPGA. This limitation allowed them to saturate the maximum bandwidth of \(3.2\) Tbps only for the vSwitch implementation of the L2-Switch. For the other evaluated use cases (Router, Firewall, INT), they achieved approximately \(1.43-2.23\) Tbps (about 45–70% of the maximum bandwidth) only.
Discussion. In most real-world scenarios, serving multiple tenants is necessary. Given that resources on SmartNICs or switches are much more limited compared to servers, it is crucial to investigate how to implement accelerators for specific tasks or domains. Additionally, it is important to ensure a high degree of fine-granularity and reusability in their design.
However, incorporating dedicated accelerators in a design increases the scheduling complexity when mapping tasks efficiently in a dynamic, multi-tenant environment. Therefore, it is necessary to investigate intelligent scheduling and placement methods, as demonstrated in [81] and [82]. The approach in [82] offers some advantages over [81], as it can map multiple tasks into a CU, potentially allowing better utilization of resources of a Pipeline Register (PR) and requiring less interaction with the scheduler. However, the authors of [82] currently only support manual mapping of physical chains to a CU, leaving automatic mapping as a topic for future research.
Works like PsPIN [53] and Flare [52] (based on [53]) have investigated the implementation of RISC-V-based MaSoC for SmartNICs and switches, respectively. These designs provide advantages such as flexibility, high performance through a high degree of parallelism, and easy scheduling due to the homogeneous design of the PEs (in contrast to [81, 82]). However, the scalability of such designs may be limited as the number of clusters and PEs in a cluster increases, due to interconnection complexity and limited access to shared memory. To address these limitations, such designs could be complemented by fine-granular hardware accelerators that can be shared within and/or between clusters. This approach, however, would potentially result in increased scheduling complexity again.
While both [102] and [82] discuss PR, [102] also includes a simple model regarding the effect of PR, during the initial (re-)configuration time of the CUs. PR provides the flexibility to change specific regions/CUs, but this typically comes at the cost of increased resource utilization, limiting the design space for (global) optimizations and potentially leaving partitions underutilized. Furthermore, it can lead to higher latencies when elements interact between partitions [102]. To address this, techniques like task chaining of offload to a single Reconfigurable Partition (RP), like investigated in [82], could provide an approach for better utilization of the RP. However, only a limited number of works, such as [82, 90, 102] have considered or investigated RP for SDN and INC. Furthermore, technologies like Nested Dynamic Function eXchange [32] still remain to be investigated in this context.

9.2 Investigation of SmartNIC Designs for eBPF/XDP Offloading to FPGAs

Brunella et al. [46] presented hXDP, a solution for accelerating eBPF/XDP programs on a NetFPGA. They implemented an IP consisting of a Programmable Input Queue and an Output Queue connected to a finite state machine called Active Packet Selector (APS) containing a read/packet and write buffer allowing parallel read and write. The APS is connected via a data bus to an implementation of the extended Instruction Set Architecture (ISA) [4] on a custom 4-stage, 4x superscalar soft-core Very Long Instruction Word (VLIW) processor on the FPGA. To take advantage of the data parallelism inherent in some functions, the VLIW processor is connected via an additional bus to the helper function module used to implement the accelerators. The interface of the helper function module follows the eBPF definition. The accelerators can read data directly from the registers containing the arguments of the function call (R1–R5) and write the results back to the return value register (R0). To support the functionality of eBPF maps4 in hardware, a configurator for the shared maps memory region on the FPGA was implemented, which creates the maps at program load time. To support direct access to individual map entries, the maps memory, like the APS, is connected directly to the VLIW processor via the data bus.
In order to run unmodified eBPF code, they provide a compiler that translates the compiled eBPF bytecode to the design’s hXDP ISA. The custom compiler for their hardware target optimizes the eBPF program by removing unnecessary instructions for the FPGA target, such as boundary checks and zeroing of memory areas for variable initialization. They also changed the ISA from a two-operand machine to a three-operand machine and added support for 6-byte alignment5 for load store instructions to fit the field size of the MAC address in a packet. Since the implementation of the VLIW processor on the FPGA is much simpler from a design perspective than that of a modern x86 CPU that supports, for example, out-of-order execution with branch prediction and speculative execution, they use static compiler analysis for instruction-level parallelism and scheduling of branches to different lanes for parallel execution, instead.
Their work showed that it is reasonable to offload XDP programs to a NIC. Not only can this lead to better performance for networking applications while being more power efficient, but it can also free up CPU cores from specialized networking tasks that they weren’t designed for and use them for other more general tasks instead. The latest version of hXDP has been integrated into Corundum [58] and now targets a Xilinx Alveo U50 [42].
A follow-up work by Bonola et al. [42] extended the in Corundum integrated design of hXDP [46] by a fused parser-MAT pipeline, called Warp Engine (WE). The idea is that simple parts of an eBPF/XDP program, such as header parsing, can be offloaded to simple MATs. In addition to the WE, they extend the hXDP compiler flow by adding their own custom compiler, called the Warp Optimizer, in front of this flow. The Warp Optimizer first reads the bytecode of the program and creates a Control Flow Graph (CFG) to analyze which parts can be offloaded. For offloading to the WE, the compiler, similar to a P4 compiler, creates match-action rules to configure the MAT. The WE also provides a feature called context restoration. This means that the WE is capable in case it cannot run the whole program it will execute only parts of it and create a correct initial state for hXDP which will then process the rest. Also, the WE is able to process the next packet while hXDP is still busy with the current packet. This ensures that the WE does not become the bottleneck in the system. Tasks such as simple packet forwarding can be handled without the hXDP components being involved at all.
In general, the concept behind this idea is the same as fast-path processing and slow-path offloading as already known from PISA architectures. The difference here is that while slow-path offloading usually uses GPP, the hXDP architecture can also provide accelerators for complex rules.
An alternative approach to [46] and [42] is eHDL [97]. Both [46] and [42] provide a predefined architecture that is mapped/prototyped on an FPGA and the eBPF bytecode is compiled for this fixed target architecture. On the other hand, eHDL [97] also takes the eBPF bytecode as input, but generates a hardware pipeline from it instead. They integrate the generated eBPF hardware pipeline into the Corundum NIC. Like the hXDP compiler, eHDL merges the two-operand eBPF instructions into three-operand instructions, and like in [42, 46] a CFG and Data Dependency Graph of the program are generated. eHDL provides templates for the hardware primitives and the instructions to them. To support (conditional) jump instructions, the pipeline creates disable signals to disable the operation for the following stages by simply forwarding the packet to the following stages until the offset (in pipeline stages) from the current stage is reached, while hXDP schedules all dependent instructions on the same lane to avoid data hazards and uses data forwarding per lane to avoid stalling the pipeline for these dependent instructions. For possible Write-after-Read data hazards, the design generates delay registers that work on a FIFO principle. In the case of Read-After-Write (RAW) data hazards, they distinguish between two cases, per-flow and global state cases. For the per-flow cases, they insert an additional block that contains the addresses of read operations and compares them with the address of a write operation. If at least one of the addresses matches, the pipeline is flushed. To avoid repeated write operations to earlier maps in case the RAW data hazard is detected at a later stage leading to a wrong system state, so-called elastic buffers are added between the pipeline stages. This allows only the pipeline stages starting with the hazard detection to be flushed, so that earlier stages do not have to be repeated. While this can theoretically improve throughput compared to a full pipeline flush, the authors’ evaluation showed that the actual probability of flushes caused by per-flow states and the resulting degradation of throughput is low. In the case of frequently used global states such as packet counters, eBPF provides an atomic operation to avoid concurrent access to the same map. To realize the same behavior in hardware, eHDL provides a block that performs in-place Key-Value (KV)-based lookups.
Discussion. The advantage of the approach presented in [97] is that it utilizes only the hardware components required for the given instructions. This avoids the overhead of instruction fetch and decode phases like in [42, 46] or the need for hardware to execute/accelerate instructions that might not be used in some cases. Both eHDL and SDNet implementations achieved line-rate performance compared to hXDP and BlueField-2. For the tested use cases, they delivered up to about \(164.4\) times higher throughput compared to hXDP and up to about \(23.4\) times higher throughput compared to BlueField-2. This demonstrates that dedicated hardware pipelines for packet processing can achieve significantly higher and more consistent performance compared to GP/MuSoC-based architectures, while being more power efficient.6 Additionally, eHDL requires less than \(20\%\) of the resources available on the FPGA (Figure 5), leaving room for scalability or providing multiple services simultaneously. However, a disadvantage exists: Adapting the architecture to a new program requires synthesis and FPGA reconfiguration. The authors of [97] acknowledge this and consider using DPR for future work. This would allow adding, replacing, or removing hardware pipelines at runtime depending on the specific services needed and/or the required throughput. While DPR might solve the downtime issue during reconfiguration, the synthesis process itself remains a significant overhead and is not suitable for fast redeployment. A potential solution to this problem, as discussed in Section 8.5, could be the use of overlays.
Fig. 5.
Fig. 5. Comparison of the reported resource usage of hXDP [46], hXDP + WE [42], and eHDL [97] with SDNet. Since all three proposals [42, 46, 97] use Corundum [58] on an Alveo U50 as the basis for the NIC design, it’s listed here as well to show the overhead introduced by the proposed designs. Rivitti et al. [97] did not report on L2 ACL and Katran, so they are omitted here. Results for Corundum are based on synthesis results obtained using Vivado 2023.1.

9.3 Proposals for Different Application Domains Realized on FPGAs

In the following we selected work conducting research on the domain of Domain Name Server (DNS), CPs, caching, ML, and security. For each domain we will first introduce the domain, present challenges, and discuss proposed works that address (some) of these challenges.
DNS. The DNS service is one of the most essential services on the Internet. It defines the mapping of human-readable domain names to machine-readable IP addresses. However, as the demand for scalability of network services such as DNS to handle an increasing number of queries while providing low latency increases, software-only solutions will no longer be sufficient.
Kiwi [104] is a HLS system to generate an RTL (Verilog) output from C# code that can be used to program FPGAs. Emu [111] is a standard library that supports the implementation of network functions with Kiwi. Part of the Emu library is to provide a simple DNS server (Emu DNS in Table 3) that can handle non-recursive queries and query resolution with names of at most 26 bytes for IPv4 addresses. However, according to the authors, these restrictions can be relaxed to handle longer names and IPv6 addresses.
Another approach to implementing a DNS service on FPGAs is P4DNS [124]. Unlike Emu, P4DNS is implemented in P4 using NetFPGA. P4DNS provides a simple low-level name server that supports recursive queries. The implementation uses features from both the data and control planes to achieve high performance. The data plane implementation is based on the RMT pipeline with some modifications responsible for the performance critical parts of the design.
Discussion. When investigating basic network services such as DNS with modern network processing models/languages such as PISA/P4, problems became apparent. First, the generality of the FSM for the parser produced by P4C would lead to increased compile time and resource usage because it would produce even hard the headers (Ethernet, IPv4, UDP, and DNS) that could be compressed into a single state. In addition, the P4C does not support variable-length fields, but DNS headers can have different lengths, the parser implementation supports multiple lengths, each with its own state [124]. Although P4 provides a vendor-independent model, the restrictions and limitations create strong hurdles that may make implementation for INC inadequate or even impossible. Therefore, further research is needed in the direction of providing/improving high-level abstraction models and (automatic) tool flows capable of efficiently translating and mapping such a model to reconfigurable hardware. In order to realize such tool flows, the investigation of improved HLS compilers for C# [104], C or C++ (e.g., from AMD/Xilinx, Intel), automatic integration on FPGA equipped SmartNIC and switches, and libraries such as Emu [111] will be of importance to ease the hurdles for implementations of reconfigurable hardware accelerators targeting INC.
CP. A fundamental problem in distributed computing is how to ensure the reliability of distributed systems while processes may be faulty. Therefore, the goal of CPs is to find consensus among processes about states or data needed for computation. The types of process failures in a distributed system can be divided into two types: Crash failures and Byzantine failures. Crash failures describe scenarios in which a process abnormally terminates its execution and cannot be resumed. Byzantine failures, on the other hand, describe arbitrary process failures. These can include cases such as crashes, incorrect states, failure to send and receive messages, sending messages with incorrect or even malicious content, and so on. Byzantine failures can disrupt other processes or cause failures that are unintentional or caused by a malicious attacker. However, they cannot control the network. Any process can uniquely identify the sender of a message [61]. Crash failures are handled by Crash Fault Tolerance CPs, and Byzantine failures are handled by Byzantine Fault Tolerant (BFT) CPs. However, to meet the demands for scalability, high bandwidth, and low latency, BFT CPs implemented in software would become a bottleneck.
Sit et al. [105] implemented the Practical Byzantine Fault Tolerance (PBFT) algorithm [47] to study different configurations of it. In their work, they show that it is unlikely that networks with more than 10 Gbps can be saturated without the use of hardware accelerators. Therefore, they investigated offloading the processing to SmartNICs and standalone FPGAs.
Discussion. One major challenge they identified for offloading PBFT algorithms is that the iterative computation of cryptographic algorithms such as RSA is difficult to realize on FPGAs because of low frequency and would account for a large portion of the logic utilization for a PBFT implementation. Nevertheless, their work shows that with SmartNICs, fine-grained batching with strict latency guarantees is a reasonable option for acceleration, and that mid-range FPGAs such as the Xilinx Virtex-7 690T are already a good target platform for PBFT implementations. In general, FPGAs perform best for high-parallel algorithms, but are not the best choice for iterative algorithms where higher frequencies are beneficial. However, we are seeing a trend where major FPGA manufacturers are offering PS tightly coupled to the FPGA’s PL on a single SoC that can also be used to implement the iterative parts of an algorithm, providing the benefits of both worlds.
Intrusion Detection and Prevention Systems(IDS/IPS). IDS/IPS are critical components of the network infrastructure to protect systems from attacks. Both systems scan network packets and compare their contents to a database of known threats. However, an IDS is only responsible for detection and monitoring and does not intervene on its own, while IPSs accept or reject packets based on a set of rules. Network operators are faced with the challenge of securing networks with hundreds of thousands of concurrent connections with tens of thousands of rules to be checked by IDS/IPS. To ensure future scalability, software-only solutions are insufficient and PISA designs such as the Tofino switch cannot be used to implement a full IDS/IPS.
Therefore, Zhao et al. proposed an FPGA-based solution for SmartNICs, called Pigasus [132]. In general, the concept of IDS/IPS is based on solving pattern matching problems (header matches, string matches, and regular expressions [132]). The proposed design consists of an FPGA connected via Ethernet and CPUs connected via PCIe. The parser, reassembler, and Multi-String Pattern Matcher (MSPM) (and its extensions Non-Fast/Fast Pattern String Matching) are implemented on the FPGA. The Regular Expression and Full Match stages are implemented on the CPUs. However, they only interact with about 5% of the packets. A core component of the design is the reassembler, which takes the forwarded packets from the parser, sorts the TCP packets, and records the last bytes of the previous packets to allow contiguous and cross-packet searches for the MSPM. To fit the Snort register ruleset [25, 98] into the available BRAM of the Intel Stratix 10 MX FPGA, the proposed design uses Hyperscan-inspired hash table lookups [123] for the MSPM instead of state machines for exact matching. For the full matcher on the CPU side, an adapted version of Snort 3 is used to receive Protocol Data Units and rule IDs. With their FPGA-based SmartNIC design combined with a CPU system, they were able to develop a single-server solution for IDS/IPS.
Discussion. The authors showed that it is generally possible to implement a complete IDS/IPS on a single FPGA. Only the full match stage was still running on a CPU because it is not the bottleneck for most packages and an implementation would have required too much BRAM (\(\sim\)24 MB) for only about 5 Gbps of traffic. Also related to the limited amount of on-chip memory, the authors faced the problem that there is a design tradeoff between the number of rules, concurrent flows, and the number of pipeline replications. While they could achieve 200 Gbps throughput with two pipelines, they were quickly limited by the number of rules they could store in the BRAM. However, since newer FPGAs can provide more on-chip memory (up to several hundred Mb), for example, in the form of Ultra-RAM (URAM) or up to several tenths of GB of HBM in a single package, implementing a complete IDS/IPS on a single FPGA may be possible with modern FPGAs.
Caching/KVS. Local storage/caching of data as close to a PE as possible is a proven technique for achieving low latency, not only for GPPs, but the concept is also used in large systems such as data centers. The reason for this is that it is not practical for data centers or other distributed systems to store all data locally at each node, which increases storage costs or requires additional synchronization to update all data over the network. Cloud server systems also use caching techniques to store frequently requested data on cloudlet servers, while requesting data from cloud servers only when it is missing from the cache to exploit data locality [37]. To meet the increasing demands for low latency, even with a large number of end-user devices, data must be stored even closer to or directly at the edge.
LaKe. LaKe [117] is a hardware solution implemented in Verilog based on the Memcached design [9, 57]. LaKe is implemented on a NetFPGA-SUME [136]. To hide access latency, the design provides two cache layers. The first layer is a shared cache consisting of fast but small on-chip BRAM, and the second layer consists of slower but larger on-board DRAM. The DRAM is used to store hash table buckets and data chunks, while the BRAM is used as a shared cache between PEs to store frequent KV pairs. In addition, a slab allocator also used in Memcached is implemented in hardware, using the available on-board SRAM to store the addresses of unused chunks, and a FIFO to preload the next available address to hide access latency.
Discussion. Their research showed a significant improvement in throughput, power efficiency, and especially latency compared to a host-based KV system. While the authors evaluated their design on a SmartNIC, they noted that such a design could also be implemented on switches. With the growing number of participants, such as IoT devices, the need to keep data close to the end devices within the network becomes apparent. This not only minimizes latency and offloads resources from the host but also relieves transmission pressure within the network, further reducing latency, when deployed on switches. In addition, conformance and seamless integration with existing technologies such as Memcached is of great importance to keep the hurdles for developers as low as possible, making it feasible to integrate INC designs into new and existing infrastructure.
ML. To efficiently train ML models, a large amount of data is required. Scaling, i.e., upgrading a device/server with more powerful hardware, is limited and economically unfeasible beyond a certain point. Therefore, a common approach to scaling large models and data volumes is to exploit data parallelism, not only on a single device, but by partitioning and distributing the data across multiple worker nodes. Training data in parallel over the network and synchronizing updates is supported by popular ML frameworks such as PyTorch [22] (Distributed Data-Parallel training) and TensorFlow [28] (tf.distribute.Strategy). Supervised ML models use iterative algorithms such as Stochastic Gradient Descent, where the sum of model updates must be computed frequently. This leads to periods of high network load for transmitting the updates, which results in the need to pause training until the updates are complete.
Liu et al. [85] proposed an in-network aggregation approach based on Remote Direct Memory Access (RDMA) [86] called NetReduce. They connected a Virtex Ultrascale FPGA [30] to a standard (nonprogrammable) switch. In their proposed design, updates sent by worker nodes are aggregated in the network and only the resulting aggregated gradient values are forwarded. They compare their proposed design to SwitchML [100], which is similar to NetReduce. However, SwitchML has been implemented for programmable ASIC switches (Tofino). To interact with the proposed switch design from the host side, they reuse the RDMA over Converged Ethernet protocol for their NetReduce protocol and the NVIDIA Collective Communication Library (NCCL) for optimized communication with the NVIDIA GPUs in their testbeds. They used the NetReduce protocol as a primitive in NCCL and used it in PyTorch and the Horovod framework to support TensorFlow.
Taurus [113] presents a modified RMT pipeline for the data plane by introducing a MapReduce control block based on CGRA Plasticine [95]. Their design targets line-rate inference for ML models such as Support Vector Machine, Deep Neural Network, k-means, and Long Short-Term Memory. However, in the absence of a CGRA-based programmable switch, the authors build a prototype using a Tofino switch connected over a 100 Gbps Ethernet link to a Xilinx Alveo U250 to emulate the MapReduce block and evaluate their design. In their follow-up work, Homunculus [114], they present a framework that supports both Taurus switches and programmable ASIC switches. Homunculus provides functions in Python that can be used, for example, with the TensorFlow function. The Python front-end generates JSON configuration files that are optimized by HyperMapper [92]. The back-end takes the optimized configuration to generate P4 or Spatial [78] code, depending on the target switch. To generate the P4 code targeting Tofino switches, they use the IIsy framework [133].
Discussion. The standard P4 programmable PISA architecture is not sufficient to accelerate ML in the data plane, according to [85] and [113]. In addition to the lack of features to implement ML models in PISA architectures [113], storing the results of the ML model [113] or the RDMA link [85] as flow rules would lead to frequent interaction with the control plane in modern dynamic network environments [85, 113], which should be avoided as it leads to additional delays in message processing. For example, in the context of inference in the network, the authors of Taurus [113] evaluated that it is not ML but rather table rule installation and packet collection that becomes the bottleneck when telemetry packets are sent to the control plane. In their study, they showed that this high delay causes the design to miss most anomalies, resulting in a low F1 score. When inference is performed on the data plane using the CGRA instead, they achieve a high F1 score regardless of the sampling rate (Table 3). However, for online training, they evaluated that using the control plane to optimize the global metric and update the weights of the Taurus ML model, that higher sampling rates and epochs with small batch size lead to faster convergence of the F1 score (Table 3). Their work showed that using ML models to optimize network functions such as anomaly detection can be realized in the network, but also that current PISA-based switches lack the necessary features and CPU-based do not provide the necessary performance to run ML models efficiently [113], which confirms the study conducted by Cooke and Fahmy [51].

9.4 Lessons Learned and Open Questions

We reviewed research proposing different hardware implementation solutions for different domains. We found that many challenges arise from the limitations of existing PISA-based architectures, but also that there is no one-size-fits-all solution. While some problems, such as variable field lengths, could be solved using the extern calls introduced with P4\({}_{16}\), P4 implementations would lose their generality/portability [124]. In addition, many PISA design solutions would require frequent interactions with the control plane to offload processing to the network [79, 85, 113] or lack necessary features [85]. Proposals such as PsPIN [53] can provide high performance despite a high degree of flexibility and parallelism, but there are limitations as soon as domain-specific acceleration is required. While [109] and [102] investigated virtualization for multi-tenant program offloading on reconfigurable switches, including security aspects [109] and a discussion about challenges regarding PR [102], their investigated scope was limited to basic network applications like L2-Switches and INT. Works such as [81, 82] have included the investigation of hardware scheduling techniques for heterogeneous SmartNIC devices. However, several areas remain open for future work, including automatization of finding the right partition size, and task mapping mechanism to decide which task or task chain to offload to a specific PR and when to perform PR.
As we have seen in the study by Sit et al. [105], while it is possible to implement entire services on FPGAs, implementations, for example, iterative algorithms may suffer from low achievable frequencies or some parts as in the work of Zhao et al. [132] are not easy to implement on hardware accelerators and the performance gain to offload such parts is not in reasonable proportion to the effort. Liu et al. [85] conclude that offloading a task such as in-network aggregation to a SmartNIC is generally only useful to free up the host CPU, since implementing a hardware accelerator requires a lot of development effort, and MuSoC-based SmartNIC such as NVIDIA’s BlueField DPUs may suffer from similar latency and throughput issues as existing in-network aggregation solutions.
Designs such as hXDP [46] have proposed combining an eBPF-compliant VLIW processor with tightly integrated accelerators and WE [42]. However, as the accelerators in hXDP are fixed at runtime, they lack like state-of-the-art MuSoC and MaSoC in the flexibility to exchange hardware accelerator online. eHDL [97] on the other hand provided an automatic eBPF \(\rightarrow\) FPGA workflow to create a whole pipeline instead. They were able to improve throughput compared to hXDP and BlueField-2 [42, 46], with similar latency compared to hXDP (Table 3). However, like hXDP [46], they do not provide online reconfigurability. On the contrary, hXDP even has the advantage that it can be reprogrammed and the WE can be repopulated at runtime [46]. Therefore, realizing online reconfigurability to support offloading and exchanging of eBPF programs (or programs in general) without reconfiguring the whole FPGA and interfering with other eBPF programs remains an open challenge for research.
The possible range of applications that can be covered with INC is wide. It includes domains like network services, security, and ML, which we have discussed in this work, but also extends to areas such as Computer Vision and robotics. To make the utilization of reconfigurable hardware accelerators attractive, not only the network community but also from other domains have to be convinced. Programming diverse network devices requires architecture-specific knowledge, even with tools like SDNet/VNP4 for FPGAs. While [42, 46, 97] investigated the offloading of eBPF/XDP to hardware accelerators, their work required implementing custom compilers for specific targets and hardware knowledge to add new features. DSLs and libraries that abstract target-specific functionalities and optimizations could potentially help to lower this burden. Additionally, compiler infrastructures like MLIR [80] can provide reusability and extensibility for rapid deployment of domain-specific compilers or feature extensions. Therefore, the investigation of design choices to support hardware acceleration for various applications without burdening application developers [135], including the seamless integration of reconfigurable hardware accelerators into existing network and application infrastructures and paradigms, is a potential topic that few have addressed. To determine the benefits and identify potential challenges of INC hardware acceleration, further research under different network traffic conditions and consideration of custom protocols [108] is needed. Additional research challenges such as programmability and adaptability are discussed in the following section.

10 Additional Open Research Challenges

Although research in the field of SDN and INC has made considerable progress, we have seen in the literature that developers and researchers face significant challenges in programming the data plane.
Limited Compute Capabilities and Adaptability. Delivering high performance from each compute device in the network, while being cost and energy efficient, cannot be achieved by software alone (Section 4). Accelerators for cryptography, for example, are implemented as hardened IP in most modern architectures. However, for many applications where existing hardened accelerators cannot be used, these devices must rely on GP cores for processing. While proposals such as eHDL [97] provide flexibility and linear performance, they come with the disadvantage of high deployment (and development) time in case the accelerator needs to be adapted. To address these issues, we believe that techniques such as JIT compilation for coarse- or multi-level hardware accelerators and the integration of embedded reconfigurable hardware such as embedded FPGAs, which may be tightly coupled to GP cores for INC acceleration, need to be further investigated.
Computation vs. Latency and Throughput. Ports and Nelson [94] concluded that PISA-based programmable ASIC switches achieve similar latency to standard nonprogrammable switches due to their fast lookup capabilities, but they lack flexibility. FPGA-based switches, on the other hand, could provide the required flexibility. However, putting more computation into the devices could come at the expense of achievable throughput and also increase latency if not done carefully.
Limited Memory and Bandwidth. Another problem is the low memory capacity of in-network hardware. High-bandwidth memory on programmable network devices is scarce. Both FPGA-based and programmable ASIC switches are limited to a few tenths of a MB of on-chip SRAM. Modern Xilinx FPGAs have a larger on-chip memory, URAM, in addition to BRAM, but the total on-chip memory is still in the range of a few hundred MB. FPGA-based devices can be configured to access on-board DRAM. However, DRAM access comes at the cost of much lower bandwidth. In addition, a few gigabytes is not much in the age of petabytes of data on servers. Therefore, in addition to a well-optimized implementation of the functionality, caching techniques are required to make efficient use of the available on-chip and on-board memory.
Abstraction and Orchestration. Benson [41] identified management challenges associated with INC. One of the challenges identified is the need for an orchestrator capable of placing functionality into the network. In order for the orchestrator to make the right decisions about what functionality to place where and when, an abstraction of the network topology is required that takes into account the available resources and capabilities of the network devices, as well as the current states of each device, e.g., network load, faults, and so on. As stated in [114], the placement of algorithms, especially in complex networks, can no longer be done by network operators alone, but requires the support of an automated orchestrator.
Multi-Tenancy and Virtualization. To facilitate hardware abstraction for orchestration, virtualization techniques will be essential to provide a more unified view of all available nodes in the network [69]. To better utilize available resources and provide high performance to multiple clients, it is necessary to be able to run different tasks simultaneously on the same device. Such virtualization techniques must provide isolation-related features, both in terms of security and performance, while allowing resource sharing and not introducing large overheads [64, 122].
Encryption/Security. Network traffic encryption, while necessary for security, hinders INC as computations can’t be performed on encrypted packets. Traditional key-sharing encryption isn’t practical for INC. Mafioletti et al. [87] demonstrated how P4 programmable INC devices can be exploited for attacks on Robot Operating System-based robotic systems. While solutions like SROS2 [88] exist, they have limitations and aren’t directly applicable to INC solutions like to [79]. Homomorphic Encryption (HE) is a potential approach, but current HE schemes are either computationally expensive or provide limited security [33]. However, research in this area is ongoing and has made promising progress.
Time Constraints and Priority Handling. INC research often lacks consideration of priority handling and congestion control. In industrial scenarios, like robot control sharing resources with other applications [79], prioritization is crucial. Adaptive decision-making is needed for processing and routing packets based on current and projected network load. This includes determining when to perform INC, considering factors like prioritizing high-priority packets during high load, forwarding low-priority packets to avoid congestion, and deciding on which nodes to compute time-constrained packets.

11 Conclusion

In this work, we provide an overview of ongoing research in the area of SDN and INC, with a primary focus on how reconfigurable hardware can contribute to this area.
First, with the advent of 6G technology and IoE, we gave motivation and presented background information on SDN and INC. We introduced P4, the most popular data plane programming language today, but also discussed eBPF/XDP and DPDK, popular frameworks for packet processing implementations. We categorized the architecture for SmartNIC and switches. We also discussed that CGRAs could be a good alternative architecture to be used in certain domains due to their faster reconfigurability and higher achievable frequency compared to FPGAs. Finally, we presented state-of-the-art research on architectures and in different domains for accelerating applications in the network. We discussed the challenges and how reconfigurable hardware accelerators could contribute. With this survey, we hope to provide some insight into this complex topic and motivate the community to further research on the use of reconfigurable hardware accelerators in the context of SDN and INC.

Acknowledgment

The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany in the programme of “Souverän. Digital. Vernetzt.”

Footnotes

1
IPSec and the closely related TLS are both protocols (or a group of protocols) for creating encrypted connections between devices for potentially unsafe networks, like the Internet. IPsec provides security on the network layer and TLS on the transport layer, providing security for application layer protocols such as HTTP.
2
But the egress pipeline support can be enabled by setting an additional compiler flag.
3
There are also several minor differences, such as the two-sided crossbar and the different scheduler design.
4
A generic memory for data sharing between the kernel and user space that can contain data of different types.
5
The standard ISA supports 1, 2, 4, and 8 byte alignment only.
6
The test setup consumes with the Alveo U50 20-25W less power compared to the BlueField-2 [97].

References

[1]
2014. Network Functions Virtualisation (NFV) - Network Operator Perspectives on Industry Progress. Technical Report. ETSI, SDN & OpenFlow World Congress, Dusseldorf-Germany.
[2]
AMD OpenNIC Project. 2023. Retrieved from https://github.com/Xilinx/open-nic
[3]
AMD Pensando\({}^{TM}\) Infrastructure Accelerators. 2023. Retrieved from https://www.amd.com/en/accelerators/pensando
[4]
BPF and XDP Reference Guide — Cilium 1.15.0-Dev Documentation. 2023. Retrieved from https://docs.cilium.io/en/latest/bpf/
[5]
DPDK. 2023. Retrieved from https://core.dpdk.org/doc/
[6]
eBPF - Introduction, Tutorials & Community Resources. 2023. Retrieved from https://ebpf.io
[7]
GitHub - NetFPGA/NetFPGA-PLUS. 2023. Retrieved from https://github.com/NetFPGA/NetFPGA-PLUS
[8]
Infrastructure Programmer Development Kit. 2023. Retrieved from https://ipdk.io/
[9]
Memcached - A Distributed Memory Object Caching System. 2023. Retrieved from https://memcached.org/
[10]
[12]
[13]
OpenCores: AES. 2023. Retrieved from https://opencores.org/projects/tiny_aes
[14]
OpenCores: SHA3 (KECCAK). 2023. Retrieved from https://opencores.org/projects/sha3
[15]
P4 DPDK Target Components. 2023. Retrieved from https://github.com/p4lang/p4-dpdk-target
[16]
The P4 Language Specification, Version 1.0. 5. 2023. Retrieved from https://p4.org/p4-spec/p4-14/v1.0.5/tex/p4.pdf
[17]
P4 Portable NIC Architecture (PNA). 2023. Retrieved from https://p4.org/p4-spec/docs/PNA.html
[18]
P4 \({}_{16}\) Language Specification. 2023. Retrieved from https://p4.org/wp-content/uploads/2022/07/P4-16-spec.html
[19]
P4\({}_{16}\) Portable Switch Architecture (PSA). 2023. Retrieved from https://p4.org/p4-spec/docs/PSA.html
[20]
P4c/Backends/Ebpf at Main \(\cdot\) P4lang/P4c. 2023. Retrieved from https://github.com/p4lang/p4c/tree/main/backends/ebpf
[21]
Programmer’s Guide — Data Plane Development Kit. 2023. Retrieved from https://doc.dpdk.org/guides/prog_guide/
[22]
PyTorch. 2023. Retrieved from https://www.pytorch.org
[24]
SDNet PX Programming Language User Guide (UG1016). 2023. Retrieved from https://docs.xilinx.com/v/u/2017.3-English/ug1016-px-programming
[25]
Snort - Network Intrusion Detection & Prevention System. 2023. Retrieved from https://www.snort.org/
[26]
SpinalHDL/VexRiscv. 2023. Retrieved from https://github.com/SpinalHDL/VexRiscv
[27]
T\({}_{4}\)P\({}_{4}\)S, a Multitarget P416 Compiler. 2023. Retrieved from https://github.com/P4ELTE/t4p4s
[28]
TensorFlow. 2023. Retrieved from https://www.tensorflow.org/
[29]
Terminology - SmartNICs Summit. 2023. Retrieved from https://smartnicssummit.com/terminology/
[33]
Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, Mauro Conti. 2018. A Survey on Homomorphic Encryption Schemes: Theory and Implementation. ACM Computing Surveys 51, 4 (2018), 1–35.
[34]
Anurag Agrawal and Changhoon Kim. 2020. Intel Tofino2 – A 12.9 Tbps P4-Programmable Ethernet Switch. In 2020 IEEE Hot Chips 32 Symposium (HCS). IEEE, Palo Alto, CA, 1–32.
[35]
Iqbal Alam, Kashif Sharif, Fan Li, Zohaib Latif, M. M. Karim, Sujit Biswas, Boubakr Nour, Yu Wang. 2021. A Survey of Network Virtualization Techniques for Internet of Things Using SDN and NFV. Computing Surveys 53, 2 (Mar. 2021), 1–40.
[36]
Gianni Antichi, M. S. Khan, F. Ali, M. Imran, and M. Shoaib. 2014. OSNT: Open Source Network Tester. IEEE Network 28, 5 (Sept. 2014), 6–12.
[37]
Mohammad Babar, Muhammad Sohail Khan, Farman Ali, Muhammad Imran, Muhammad Shoaib. 2021. Cloudlet Computing: Recent Advances, Taxonomy, and Challenges. IEEE Access 9 (2021), 29609–29622.
[38]
Fetia Bannour, S. Souihi, and A. Mellouk. 2018. Distributed SDN Control: Survey, Taxonomy, and Challenges. IEEE Communications Surveys & Tutorials 20, 1 (2018), 333–354.
[39]
Luca Barsellotti, F. Alhamed, J. J. V. Olmos, F. Paolucci, P. Castoldi, and F. Cugini. 2022. Introducing Data Processing Units (DPU) at the Edge [Invited]. In 2022 International Conference on Computer Communications and Networks (ICCCN), 1–6.
[40]
Daniel U. Becker. 2012. Efficient Microarchitecture for Network-on-Chip Routers. Stanford University.
[41]
Theophilus A. Benson. 2019. In-Network Compute: Considered Armed and Dangerous. In the Workshop on Hot Topics in Operating Systems. ACM, Bertinoro Italy, 216–224.
[42]
Marco Bonola, G. Belocchi, A. Tulumello, M. S. Brunella, G. Siracusano, G. Bianchi, and R. Bifulco. 2022. Faster Software Packet Processing on \(\{\)FPGA\(\}\) \(\{\)NICs\(\}\) with \(\{\)eBPF\(\}\) Program Warping. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 987–1004.
[43]
Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, Nick McKeown, Martin Izzard, Fernando Mujica, Mark Horowitz. 2013. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 99–110.
[44]
Pat Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, and D. Talayco. 2014. P4: Programming Protocol-Independent Packet Processors. ACM SIGCOMM Computer Communication Review 44, 3 (Jul. 2014), 87–95.
[45]
ONF. 2014. OpenFlow-Enabled SDN and Network Functions Virtualization. Open Networking Foundation 17 (2014), 1–12.
[46]
Marco Spaziani Brunella, G. Belocchi, and M. Bonola. 2022. hXDP: Efficient Software Packet Processing on FPGA NICs. Communications of the ACM 65, 8 (Aug. 2022), 92–100.
[47]
Miguel Castro and Barbara Liskov. 1999. Practical Byzantine Fault Tolerance. OsDI 99, 1999 (1999), 173–186.
[48]
D. C. Chen and J. M. Rabaey. 1992. A Reconfigurable Multiprocessor IC for Rapid Prototyping of Algorithmic-Specific High-Speed DSP Data Paths. IEEE Journal of Solid-State Circuits 27, 12 (Dec. 1992), 1895–1904.
[49]
S. Alexander Chin, K. P. Niu, M. Walker, S. Yin, A. Mertens, J. Lee, and J. H. Anderson. 2018. Architecture Exploration of Standard-Cell and FPGA-Overlay CGRAs Using the Open-Source CGRA-ME Framework. In the 2018 International Symposium on Physical Design. ACM, Monterey CA, 48–55.
[50]
Douglas Comer and Adib Rastegarnia. 2019. Toward Disaggregating the SDN Control Plane. IEEE Communications Magazine 57, 10 (Oct. 2019), 70–75.
[51]
Ryan A. Cooke and Suhaib A. Fahmy. 2020. A Model for Distributed In-Network and Near-Edge Computing with Heterogeneous Hardware. Future Generation Computer Systems 105 (Apr. 2020), 395–409.
[52]
Daniele De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler. 2021. Flare: Flexible in-Network Allreduce. In the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, St. Louis Missouri, 1–16.
[53]
Salvatore Di Girolamo, A. Kurth, and A. Calotoiu. 2021. A RISC-V in-Network Accelerator for Flexible High-Performance Low-Power Packet Processing. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 958–971.
[54]
Ankur Dumka. 2018. Innovations in Software-Defined Networking and Network Functions Virtualization. IGI Global.
[55]
Danijela Efnusheva, G. Dokoski, A. Tentov, and M. Kalendar. 2016. Memory-Centric Approach of Network Processing in a Modified RISC-Based Processing Core. In 2016 Future Technologies Conference (FTC). IEEE, San Francisco, CA, 1181–1188.
[56]
K. Egevang and P. Francis. 1994. RFC1631: The IP Network Address Translator (NAT). RFC Editor.
[57]
Brad Fitzpatrick. 2004. Distributed Caching with Memcached. Linux Journal 2004, 124 (Aug. 2004), 5.
[58]
Alex Forencich, A. C. Snoeren, G. Porter, and G. Papen. 2020. Corundum: An Open-Source 100-Gbps Nic. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, Fayetteville, AR, 38–46.
[59]
Michael Galles and Francis Matus. 2021. Pensando Distributed Services Architecture. IEEE Micro 41, 2 (Mar. 2021), 43–49.
[60]
Michael Brian Galles, J. Bradley Smith, and Hemant Vinchure. 2022. Programmable Computer IO Device Interface. Patent No. US11263158B2, Filed Feb. 19th, 2019, Issued Mar. 1st., 2022.
[61]
D. Imbs, M. Raynal, and J. Stainer. 2016. Are byzantine failures really different from crash failures? In International Symposium on Distributed Computing (DISC), Lecture Notes in Computer Science (LNTCS), Vol. 9888, 215–229.
[62]
Ali Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’11), 323–336.
[63]
Zhiyuan Guo, Y. Shan, X. Luo, Y. Huang, and Y. Zhang. 2022. Clio: A Hardware-Software Co-Designed Disaggregated Memory System. In the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, Lausanne Switzerland, 417–433.
[64]
Sol Han, S. Jang, H. Choi, H. Lee, and S. Pack. 2020. Virtualization in Programmable Data Plane: A Survey and Open Challenges. IEEE Open Journal of the Communications Society 1 (2020), 527–534.
[65]
Zijun Hang, Y. Wang, and S. Huang. 2021. P4 Transformer: Towards Unified Programming for the Data Plane of Software Defined Network. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), 544–551.
[66]
R. W. Hartenstein, A. G. Hirschbiel, M. Riedmuller, K. Schmidt, and M. Weber. 1991. A Novel ASIC Design Approach Based on a New Machine Paradigm. IEEE Journal of Solid-State Circuits 26, 7 (Jul. 1991), 975–989.
[67]
Frederik Hauser, M. Häberle, D. Merling, and S. Lindner. 2023. A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research. Journal of Network and Computer Applications 212 (2023), 103561.
[68]
Toke Høiland-Jørgensen, M. Häberle, D. Merling, S. Lindner, V. Gurevich, F. Zeiger, R. Frank, and M. Menth. 2018. The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel. In the 14th International Conference on Emerging Networking Experiments and Technologies. ACM, Heraklion, Greece, 54–66.
[69]
Ning Hu, Z. Tian, X. Du, and M. Guizani. 2021. An Energy-Efficient In-Network Computing Paradigm for 6G. IEEE Transactions on Green Communications and Networking 5, 4 (Dec. 2021), 1722–1733.
[70]
Stephen Ibanez, G. Brebner, N. McKeown, and N. Zilberman. 2019. The P4-\(\gt\)NetFPGA Workflow for Line-Rate Packet Processing. In the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, Seaside, CA, 1–9.
[71]
Bassey Isong, R. R. S. Molose, A. M. Abu-Mahfouz, and N. Dladlu. 2020. Comprehensive Review of SDN Controller Placement Strategies. IEEE Access 8 (2020), 170070–170092.
[72]
Abhishek Kumar Jain, D. L. Maskell, and S. A. Fahmy. 2022. Coarse Grained FPGA Overlay for Rapid Just-in-Time Accelerator Compilation. IEEE Transactions on Parallel and Distributed Systems 33, 6 (Jun. 2022), 1478–1490.
[73]
Wei Jiang, B. Han, M. A. Habibi, and H. D. Schotten. 2021. The Road Towards 6G: A Comprehensive Survey. IEEE Open Journal of the Communications Society 2 (2021), 334–366.
[74]
Anuj Kalia, M. Kaminsky, and D. G. Andersen. 2014. Using RDMA Efficiently for Key-Value Services. In the 2014 ACM Conference on SIGCOMM, 295–306.
[75]
Elie Kfoury, J. Crichigno, and E. Bou-Harb. 2021. An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends. IEEE Access 9 (Feb. 2021), 87094–87155.
[76]
Sajad Khorsandroo, A. G. Sánchez, A. S. Tosun, J. M. Arco, and R. Doriguzzi-Corin. 2021. Hybrid SDN Evolution: A Comprehensive Survey of the State-of-the-Art. Computer Networks 192 (Jun. 2021), 107981.
[77]
Somayeh Kianpisheh and Tarik Taleb. 2023. A Survey on In-Network Computing: Programmable Data Plane and Technology Specific Applications. IEEE Communications Surveys & Tutorials 25, 1 (2023), 701–761.
[78]
David Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, and L. Nardi. 2018. Spatial: A Language and Compiler for Application Accelerators. In the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, Philadelphia, PA, 296–311.
[79]
Sándor Laki, C. Györgyi, J. Pető, P. Vörös, and G. Szabó. 2022. {In-Network} Velocity Control of Industrial Robot Arms. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’22), 995–1009.
[80]
Chris Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, and T. Shpeisman. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2–14.
[81]
Jiaxin Lin, K. Patel, B. E. Stephens, A. Sivaraman, and A. Akella. 2020. PANIC: A High-Performance Programmable NIC for Multi-Tenant Networks. In USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20), 243–259.
[82]
Will Lin, Y. Shan, R. Kosta, A. Krishnamurthy, and Y. Zhang. 2024. SuperNIC: An FPGA-Based, Cloud-Oriented SmartNIC. In the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, Monterey, CA, 130–141.
[83]
Leibo Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei. 2020. A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications. Computing Surveys 52, 6 (Nov. 2020), 1–39.
[84]
Ming Liu, Simon Peter, Arvind Krishnamurthy, and Phitchaya Mangpo Phothilimthana. 2019. E3: {Energy-Efficient{ Microservices on {SmartNIC-Accelerated{ Servers. In 2019 USENIX Annual Technical Conference (USENIX ATC ’19), 363–378.
[85]
Shuo Liu, Q. Wang, J. Zhang, W. Wu, Q. Lin, Y. Liu, M. Xu, M. Canini, R. C. C. Cheung, and J. He. 2023. In-Network Aggregation with Transport Transparency for Distributed Training. In the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 3. ACM, 376–391.
[86]
Shaonan Ma, T. Ma, K. Chen, and Y. Wu. 2022. A Survey of Storage Systems in the RDMA Era. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 4395–4409.
[87]
Diego Rossi Mafioletti, R. C. de Mello, M. Ruffini, V. Frascolla, M. Martinello, and M. R. N. Ribeiro. 2021. Programmable Data Planes as the Next Frontier for Networked Robotics Security: A ROS Use Case. In 2021 17th International Conference on Network and Service Management (CNSM). 160–165.
[88]
Victor Mayoral-Vilches, R. White, and G. Caiazza. 2022. SROS2: Usable Cyber Security Tools for ROS 2. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Kyoto, Japan.
[89]
Oliver Michel, R. Bifulco, G. Retvari, and S. Schmid. 2022. The Programmable Data Plane: Abstractions, Architectures, Algorithms, and Applications. Computing Surveys 54, 4 (May 2022), 1–36.
[90]
Vaibhawa Mishra, Q. Chen, and G. Zervas. 2016. REoN: A Protocol for Reliable Software-Defined FPGA Partial Reconfiguration over Network. In 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig). IEEE, 1–7.
[91]
Marie-José Montpetit. 2022. The Network as a Computer Board: Architecture Concepts for In-Network Computing in the 6G Era. In 2022 1st International Conference on 6G Networking (6GNet), 1–5.
[92]
Luigi Nardi, A. Souza, D. Koeplinger, and K. Olukotun. 2019. HyperMapper: A Practical Design Space Exploration Framework. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 425–426.
[93]
Quoc-Viet Pham, F. Fang, V. N. Ha, M. J. Piran, M. Le, L. B. Le, W. J. Hwang, and Z. Ding. 2020. A Survey of Multi-Access Edge Computing in 5G and beyond: Fundamentals, Technology Integration, and State-of-the-Art. IEEE Access 8 (2020), 116974–117017.
[94]
Dan R. K. Ports and Jacob Nelson. 2019. When Should the Network Be the Computer? In the Workshop on Hot Topics in Operating Systems. ACM, Bertinoro, Italy, 209–215.
[95]
Raghu Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, and C. Kozyrakis. 2017. Plasticine: A Reconfigurable Architecture for Parallel Patterns. In the 44th Annual International Symposium on Computer Architecture. ACM, Toronto, ON, Canada, 389–402.
[96]
Ju Ren, D. Zhang, S. He, Y. Zhang, and T. Li. 2020. A Survey on End-Edge-Cloud Orchestrated Network Computing Paradigms: Transparent Computing, Mobile Edge Computing, Fog Computing, and Cloudlet. Computing Surveys 52, 6 (Nov. 2020), 1–36.
[97]
Alessandro Rivitti, Roberto Bifulco, Angelo Tulumello, Marco Bonola, and Salvatore Pontarelli. 2023. eHDL: Turning eBPF/XDP Programs into Hardware Designs for the NIC. In the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 3, 208–223.
[98]
Martin Roesch. 1999. Snort – Lightweight Intrusion Detection for Networks. LISA ’99: Proceedings of the 13th USENIX Conference on System Administration 1999, 229–238.
[99]
H. Sabireen and V. Neelanarayanan. 2021. A Review on Fog Computing: Architecture, Fog with IoT, Algorithms and Research Challenges. ICT Express 7, 2 (2021), 162–176.
[100]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’21), 785–808.
[101]
Mateus Saquetti, Guilherme Bueno, Weverton Cordeiro, and Jose Rodrigo Azambuja. 2020. P4VBox: Enabling P4-Based Switch Virtualization. IEEE Communications Letters 24, 1 (Jan. 2020), 146–149.
[102]
Mateus Saquetti, Raphael M. Brum, Bruno Zatt, Samuel Pagliarini, Weverton Cordeiro, and Jose R. Azambuja. 2021. A Terabit Hybrid FPGA-ASIC Platform for Switch Virtualization. In 2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, Tampa, FL, 73–78.
[103]
Husain Sharaf, Imtiaz Ahmad, and Tassos Dimitriou. 2022. Extended Berkeley Packet Filter: An Application Perspective. IEEE Access 10 (2022).
[104]
Satnam Singh and David J. Greaves. 2008. Kiwi: Synthesis of FPGA Circuits from Parallel Programs. In 2008 16th International Symposium on Field-Programmable Custom Computing Machines, 3–12.
[105]
Man-Kit Sit, Manuel Bravo, and Zsolt István. 2021. An Experimental Framework for Improving the Performance of BFT Consensus for Future Permissioned Blockchains. In the 15th ACM International Conference on Distributed and Event-Based Systems (DEBS), 55–65.
[106]
Anirudh Sivaraman, Suvinay Subramanian, Mohammad Alizadeh, Sharad Chole, Shang-Tse Chuang, Anurag Agrawal, Hari Balakrishnan, Tom Edsall, Sachin Katti, and Nick McKeown. 2016. Programmable Packet Scheduling at Line Rate. In the 2016 ACM SIGCOMM Conference. ACM, Florianopolis, Brazil, 44–57.
[107]
Haoyu Song. 2013. Protocol-Oblivious Forwarding: Unleash the Power of SDN through a Future-Proof Forwarding Plane. In the 2nd ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking - HotSDN ’13.
[108]
Brent E. Stephens, Darius Grassi, Hamidreza Almasi, Tao Ji, Balajee Vamanan, and Aditya Akella. 2021. TCP Is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP). In the 20th ACM Workshop on Hot Topics in Networks (HotNets), 61–68.
[109]
Radostin Stoyanov and Noa Zilberman. 2020. MTPSA: Multi-Tenant Programmable Switches. In the 3rd P4 Workshop in Europe. ACM, Barcelona, Spain, 43–48.
[110]
Henning Stubbe. 2017. P4 Compiler & Interpreter: A Survey. IITM 47 (2017), 47–52.
[111]
Nik Sultana, S Galea, David J. Greaves, and Marcin Wojcik. 2017. Emu: Rapid Prototyping of Networking Services. In the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’17), 459–471.
[112]
Naru Sundar, Brad Burres, Yadong Li, Dave Minturn, Brian Johnson, and Nupur Jain. 2023. 9.4 An In-depth Look at the Intel IPU E2000. In 2023 IEEE International Solid-State Circuits Conference (IEEE ISSCC), 162–164.
[113]
Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Ishan Gaur, and Kunle Olukotun. 2022. Taurus: A Data Plane Architecture for Per-Packet ML. In the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1099–1114.
[114]
Tushar Swamy, Annus Zulfiqar, Luigi Nardi, Muhammad Shahbaz, and Kunle Olukotun. 2023. Homunculus: Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks. In the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 3, 329–342.
[115]
Cheng Tan, Nicolas Bohm Agostini, Jeff Zhang, Marco Minutoli, Vito Giovanni Castellana, and Chenhao Xie. 2021. OpenCGRA: Democratizing Coarse-Grained Reconfigurable Arrays. In 2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 149–155.
[116]
D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J. Wetherall, and G. J. Minden. 1997. A Survey of Active Network Research. IEEE Communications Magazine 35, 1 (Jan. 1997), 80–86.
[117]
Yuta Tokusashi, Hiroki Matsutani, and Noa Zilberman. 2018. LaKe: The Power of In-Network Computing. In 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig). IEEE, Cancun, Mexico, 1–8.
[118]
Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated {Key-Value} Stores. In 2020 USENIX Annual Technical Conference (USENIX ATC ’20), 33–48.
[119]
Marcos A. M. Vieira, Matheus S. Castanho, Racyus D. G. Pacífico, Elerson R. S. Santos, and Eduardo P. M. Câmara Júnior. 2021. Fast Packet Processing with eBPF and XDP: Concepts, Code, Challenges, and Applications. Computing Surveys 53, 1 (Jan. 2021), 1–36.
[120]
Péter Vörös, Dániel Horpácsi, Róbert Kitlei, Dániel Leskó, Máté Tejfel, and Sándor Laki. 2018. T4P4S: A Target-Independent Compiler for Protocol-Independent Packet Processors. In 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR), 1–8.
[121]
Han Wang, Robert Soulé, Huynh Tu Dang, Ki Suh Lee, Vishal Shrivastav, Nate Foster, and Hakim Weatherspoon. 2017. P4FPGA: A Rapid Prototyping Framework for P4. In the Symposium on SDN Research. ACM, Santa Clara, CA, 122–135.
[122]
Tao Wang, Hang Zhu, Fabian Ruffy, Xin Jin, Anirudh Sivaraman, Dan R. K. Ports, and Aurojit Panda. 2020. Multitenancy for Fast and Programmable Networks in the Cloud. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’20) USENIX Association, Article 10, 10.
[123]
Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19), 631–648.
[124]
Jackson Woodruff, Murali Ramanujam, and Noa Zilberman. 2019. P4DNS: In-Network DNS. In 2019 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ACM/IEEE ANCS), 1–6.
[125]
Junjie Xie, Deke Guo, Zhiyao Hu, Ting Qu, and Pin Lv. 2015. Control Plane of Software Defined Networks: A Survey. Computer Communications 67 (Aug. 2015), 1–10.
[126]
Junfeng Xie, F. Richard Yu, Tao Huang, Renchao Xie, Jiang Liu, and Chenmeng Wang. 2019. A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges. IEEE Communications Surveys & Tutorials 21, 1 (2019), 393–430.
[127]
Fan Yang, Zhan Wang, Xiaoxiao Ma, Guojun Yuan, and Xuejun An. 2019. Understanding the Performance of In-Network Computing: A Case Study. In 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (IEEE ISPA/BDCloud/SocialCom/SustainCom), 26–35.
[128]
Bo Yi, Xingwei Wang, Keqin Li, Sajal K. Das, and Min Huang. 2018. A Comprehensive Survey of Network Function Virtualization. Computer Networks 133 (Mar. 2018), 212–262.
[129]
Rafael Zamacola, Andres Otero, Alberto Garcia, and Eduardo de la Torre. 2020. An Integrated Approach and Tool Support for the Design of FPGA-Based Multi-Grain Reconfigurable Systems. IEEE Access 8 (2020).
[130]
Rafael Zamacola, Andrés Otero, and Eduardo de la Torre. 2021. Multi-Grain Reconfigurable and Scalable Overlays for Hardware Accelerator Composition. Journal of Systems Architecture 121 (Dec. 2021).
[131]
Rafael Zamacola, Andrés Otero, Alfonso Rodríguez, and Eduardo de la Torre. 2022. Just-in-Time Composition of Reconfigurable Overlays. In the 13th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 11th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM ’22).
[132]
Zhipeng Zhao, Hugo Sadok, Nirav Atre, James C. Hoe, Vyas Sekar, and Justine Sherry. [n.d.]. Achieving 100 Gbps Intrusion Prevention on a Single Server. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’20), 1083–1100.
[133]
Changgang Zheng, Zhaoqi Xiong, Thanh T. Bui, Siim Kaupmees, Riyad Bensoussane, Antoine Bernabeu, Shay Vargaftik, Yaniv Ben-Itzhak, and Noa Zilberman. 2024. IIsy: Hybrid In-Network Classification Using Programmable Switches. IEEE/ACM Transactions on Networking 32, 3 (June 2024), 2555–2570. DOI:
[134]
Changgang Zheng, Xinpeng Hong, Damu Ding, Shay Vargaftik, Yaniv Ben-Itzhak, and Noa Zilberman. 2023. In-Network Machine Learning Using Programmable Network Devices: A Survey. IEEE Communications Surveys & Tutorials 26, 2 (2023), 1171–1200.
[135]
Xiangfeng Zhu, Weixin Deng, Banruo Liu, Jingrong Chen, Yongji Wu, Thomas Anderson, Arvind Krishnamurthy, Ratul Mahajan, and Danyang Zhuo. 2023. Application Defined Networks. In the 22nd ACM Workshop on Hot Topics in Networks (HotNets), 87–94.
[136]
Noa Zilberman, Yury Audzevich, G. Adam Covington, and Andrew W. Moore. 2014. NetFPGA SUME: Toward 100 Gbps as Research Commodity. IEEE Micro 34, 5 (Sept. 2014), 32–41.
[137]
Noa Zilberman, Philip M. Watts, Charalampos Rotsos, and Andrew W. Moore. 2015. Reconfigurable Network Systems and Software-Defined Networking. Proceedings of the IEEE 103, 7 (Jul. 2015), 1102–1124.

Index Terms

  1. A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Reconfigurable Technology and Systems
    ACM Transactions on Reconfigurable Technology and Systems  Volume 18, Issue 1
    March 2025
    319 pages
    EISSN:1936-7414
    DOI:10.1145/3703028
    • Editor:
    • Deming Chen
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 December 2024
    Online AM: 10 October 2024
    Accepted: 14 September 2024
    Revised: 16 July 2024
    Received: 20 December 2023
    Published in TRETS Volume 18, Issue 1

    Check for updates

    Author Tags

    1. In-Network Computing
    2. Software-Defined Networking
    3. FPGA
    4. CGRA
    5. SoC

    Qualifiers

    • Research-article

    Funding Sources

    • Joint project 6G-life
    • German Research Foundation (DFG, Deutsche Forschungsgemeinschaft)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 676
      Total Downloads
    • Downloads (Last 12 months)676
    • Downloads (Last 6 weeks)346
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media