A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing
Abstract
1 Introduction
2 Related Work and Motivation
3 Enabling Technologies
3.1 SDN
3.2 NFV
4 INC
5 Taxonomy
6 Programmable Switches and SmartNICs
7 Programmability of Network Devices
8 Architectures
8.1 GPPs
8.2 ASICs
8.3 FPGAs
8.4 SoC
Class | Manufacture | Chip(-Family) | Language/Framework | Architecture |
---|---|---|---|---|
RSoC | AMD/Xilinx | Alveo U25N | Vitis (HLS, Verilog, VHDL, P4) | XCU25N SoC with FPGA fabric and 4x Arm Cortex-A53 |
Intel | N6000-PL | DPDK, FlexRAN (vRan only), OPAE, OFS, | AGF014 SoC with FPGA fabric and 4x Arm Cortex-A53 | |
(N6010/6011) | Intel Quartus Prime Pro Edition | |||
MuSoC | Microsoft/Fungible | S1 | eBPF, C | 16x MIPS64 R6 Cores |
Marvell/Cavium | OCTEON 10 | P4, eBPF, DPDK | 8–24x Arm Neoverse N2 | |
NVIDIA/Mellanox | BlueField-2 | DOCA (SPDK, DPDK, P4, Netlink) | 8x Arm Cortex-A72 | |
BlueField-3 | DOCA (SPDK, DPDK, P4, Netlink) | 8x or 16x Arm Cortex-A78 | ||
MaSoC | AMD/Pensando | Capri | P4, DPDK | 4x Arm Cortex-A72, 112 MPUsa |
Elba | P4, DPDK | 16x Arm-Cortex A72, 144 MPUs | ||
Netronome | NFP-4000 | P4, C, DPDK | Arm11-Core \(+\) 4 FPC; 48 PPC (In- and Egress); 5x 12 FPCb | |
NFP-6000 | P4, C, DPDK | Arm11-Core \(+\) 4 FPC; 96 PPC (In- and Egress); 10x 12 FPC | ||
Marvell/Cavium | OCTEON III | P4, eBPF, DPDK | 2x MIPS64 R5 \(+\) 48x cnMIPS64 v3 | |
(CN7890) | ||||
Microsoft/Fungible | F1 | eBPF, C | 4x MIPS64 (2xSMT) \(+\) 8x 6 MIPS64 (4xSMT) |
8.5 CGRA
9 Exploring FPGA-Based Designs for the PDP
Scope | Proposal | Year | Target Platform | Details |
---|---|---|---|---|
DNS | P4DNS [124] | 2019 | Switch (FPGA) | —P4 implementation of DNS service. —Control plane communication via DMA. —Use of P4 \(\rightarrow\) NetFPGA workflow. —Comparison with Emu based on reported in [111]. —Provides more features than Emu DNS. |
CP | Variant of PBFT [105] | 2020 | SmartNIC and FPGA | —Implementation and evaluation of different configurations for PBFT. —Study for SmartNIC offloading and implementing of PBFT completely on an FPGA. |
Caching/KVS | LaKe [117] | 2018 | Switch/NIC (FPGA) | —HW implementation based on the concept of the Memcached system. —Multilevel cache architecture using on-chip and on-board memory. —Comparison with Emu based on reported in [111]. —Slightly lower cache-hit latency compared to Emu. \(\quad\circ\) Emu does not have a cache hierarchy. |
Emu [111] | 2017 | Switch/NIC (CPU, FPGA) | —C# library for Kiwi compiler. —C# code is transformed to Verilog code by Kiwi. —Additional comparison with NetFGPA and P4FPGA [121]. | |
PANIC [81] | 2020 | SmartNIC (ASIC/FPGA) | —Heterogeneous CUs (accelerator or processor) \(\quad\rightarrow\) Support of hardware acceleration and software offloads —CUs connected over a crossbar to a central hardware packet scheduler using a PIFO. —Uses Corundum’s [58] NIC driver, DMA engine, MAC, and PHY. —FPGA prototype and ASIC implementation analysis. | |
General/Multi-Domain | SuperNIC [82] | 2024 | SmartNIC (FPGA) | —Mapping of task DAG to reconfigurable regions on FPGA. —Reconfigurable regions/CUs are connected over a crossbar to a central scheduler. —CU can contain a chain of multiple tasks. Wrapper around each task for bypassing. —Subgraphs (virtual task chain) extracted from a tenant’s DAG are scheduled to tasks in CU. —Supports tenant CU sharing, replication, and virtual task chain parallelism. —Fairness mechanism based on space sharing using DRF approach [62] and time-sharing. |
MTPSA [109] | 2020 | Switch (CPU, FPGA) | —Proposals of security isolation mechanisms for the PSA architecture. —Separation of superuser- and user pipelines. Concept of roles taken from OSs. —Proposal to encapsulate user pipelines in superuser egress pipeline (between parser and MAT). —Encapsulated packet decapsulated by superuser- and processed by user pipeline. —User program opaque to superuser and other user (P4) programs. | |
Terabit Switch Virtualization [102] | 2021 | Switch (ASIC-FPGA) | —Proposed single SoC solution with network switching logic as ASIC and (embedded) FPGA logic for adaptable packet processing and switch virtualization. —Based on their P4VBox reference design [101]. —Analysis of ASIC fabric and evaluation of parallel running network application on the FPGA. | |
hXDP [46] | 2020 | SmartNIC (ASIC/FPGA) | —Offloading of XDP programs to a SmartNIC. —Custom VLIW processor with accelerators. —Custom compiler to optimize the eBPF byte code for the offload engine. —Implemented on FPGA, but fixed design \(\rightarrow\) can be implemented as ASIC. | |
eBPF/XDP Offload | hXDP \(+\) WE [42] | 2022 | SmartNIC (ASIC/FPGA) | —Extends hXDP with a MAT pipeline (Warp Engine (WE)) in front of the hXDP offload engine. —Custom compiler in front of the hXDP compiler: \(\quad\rightarrow\) Identifies parts to be offloaded to the MAT pipeline. \(\quad\rightarrow\) Rest is compiled and executed by the hXDP part. —Integration of hXDP (\(+\) Warp Engine) in Corundum and ported to Alveo U50. —Fixed latency of 28 clock cycles (112 ns @250 MHz) for WE. |
eHDL [97] | 2023 | SmartNIC (FPGA) | —Generation of a hardware pipeline based on the analysis of the eBPF byte code. \(\quad\circ\) Compiler translates eBPF byte code to VHDL code. —Unlike hXDP, resource utilization depends on the application (Figure 5) | |
Taurus [113] | 2022 | Switch (ASIC, CGRA) | —Proposal for CGRA-based switch architecture for ML. —Prototype using Tofino switch for MAT pipeline and FPGA for CGRA implementation. —FPGA is connected over Ethernet with the Tofino switch. —CGRA used for MapReduce \(\rightarrow\) Can be bypassed for normal PISA flow. —Training on control and inference on data plane (Taurus). | |
Machine Learning | Homunculus [114] | 2023 | Switch (ASIC, CGRA) | —Framework for mapping ML models to supported switch targets (Tofino, P4-SDNet and Taurus). —Python Front-end: Provides functions that can be used to integrate existing libraries such as TensorFlow. —Middle-end: HyperMapper to optimize the configuration file. —Back-end: Generates code (Spatial [78] and P4). |
NetReduce [85] | 2023 | Switch (ASIC/FPGA) | —Accelerating aggregation in the network for ML. —Prototype: External FPGA attached to a commodity switch. —Discusses the implementation of the proposed design as ASIC. —Discussion of the limitations of (P4) programmable switches. | |
Security | Pigasus [132] | 2020 | SmartNIC (FPGA) | —Single server IDS/IPS. —Parser, reassembler, and MSPM on FPGA. —Regular expression and full match stages on host. |
Proposal | Evaluation Setup | Relevant Results |
---|---|---|
Emu DNS [111] | —Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4-RAM \(\quad\circ\) NIC: Intel 82599ES (10 GbE) —NetFPGA-SUME: Emu —Traffic Source: OSNT [36] | —Throughput (Host): \(\color{green}\uparrow\) \(\approx 5.2x\) —Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/66.5x\) —Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/74.4x\) |
P4DNS [124] | —Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB RAM \(\quad\circ\) NIC: Solarflare SFC9220 (10 GbE) \(\quad\circ\) Software: NSD [12] —NetFPGA-SUME: P4DNS —Traffic Source: OSNT [36] | —Throughput (NSD): \(\color{green}\uparrow\) \(\approx 52x\) —Throughput (Emu): \(\color{green}\uparrow\) \(\approx 10x\) —Latency (\(99\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/54.2x\) —Latency (\(99\)th-perc., Emu): \(\color{red}\uparrow\) \(\approx 1.8x\) —Latency (\(50\)th-perc., NSD): \(\color{green}\downarrow\) \(\approx 1/36.7x\) |
Variant of PBFT [105] | —Cluster: 24 machines, each Intel Xeon E-2186G \(\quad\circ\) 10 GbE network —Consensus group size: 15 nodes | —Acceleration of data movement and hashing are more beneficial than crypto only acceleration. —Optimal goodput depends on fine granular batching. —Area efficiency (Throughput/Area, Intel): \(\quad\circ\) \(\color{green}\uparrow\) \(1.5x\) (Packet Filter) - \(248x\) (KV) |
Emu Memcached [111] | —Host: Intel Xeon E5-2637v4 @3.5 GHz, 64 GB DDR4 \(\quad\circ\) NIC: Intel 82599ES (10 GbE) —NetFPGA-SUME: Emu —Traffic Source: OSNT [36] | —Throughput (Host): \(\color{green}\uparrow\) \(\approx 2.2x\) —Latency (Avg., Host): \(\color{green}\downarrow\) \(\approx 1/20.1x\) —Latency (\(99\)th-perc., Host): \(\color{green}\downarrow\) \(\approx 1/22.7x\) |
LaKe [117] | —Host: Intel Core i7-4770, 64 GB RAM \(\quad\circ\) Software: Linux Memcached —NetFPGA-SUME: LaKe —Traffic Source: OSNT [36] | —Throughput (Host): \(\color{green}\uparrow\) \(\approx 13.6x\) —Throughput (Emu): \(\color{green}\uparrow\) \(\approx 6.8x\) —Latency (Cache Hit, Host): \(\color{green}\downarrow\) \(\approx 1/205x\) —Latency (Cache Hit, Emu): \(\color{red}\rightarrow\) \(\approx 1x\) —Latency (Cache Miss, Host): \(\color{green}\downarrow\) \(\approx 1/42x\) —Latency (Cache Hit vs. Miss, LaKe): \(\color{red}\uparrow\) \(\approx 4.6x\) |
PANIC [81] | —Server 1: Dell PowerEdge 640 \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) \(\quad\circ\) Data Source: DPDK custom packets —Server 2: Dell PowerEdge 640 \(\quad\circ\) ADM-PCIE-9V3 (VU3P-2 FPGA): PANIC \(\quad\circ\) Use open-source IPs as CUs: RISC-V @250 MHz, AES-256 @250 MHz, SHA-3 @150 MHz —ConnectX-5 directly connected with PANIC | —Achieved frequency: 250 MHz \(\quad\circ\) Exception PIFO: 125 MHz —11.27% LUT utilization \(\quad\circ\) High utilization by PIFO —8.94% BRAM utilization —Achieved throughput: 100 Gbps (linerate) \(\quad\circ\) Maximum on-chip bandwidth: 256 Gbps —Latency: Scheduling, load and CU dependent \(\quad\circ\) \(\lt 16\,{\mu}\textrm{s}\) in evaluation |
SuperNIC [82] | —Evaluation using Verilator Simulation (\(s\)) and Testbed (\(t\)). —Testbed (KVS use case): Cluster connected over a 100 GbE switch \(\quad\circ\) SuperNIC: HiTech Global HTG-9200 (AMD/Xilinx VU9P, 9x100 GbE) (\(A^{*}\)) \(\quad\circ\) 2 Server: Dell PowerEdge R740 (Xeon Gold 5128) with NVIDIA/Mallenox ConnectX-4 NIC (100 GbE) (\(B\): Clover [118], \(C\): HERD [74]) and NVIDIA/Mallenox BlueField-Gen1 NIC (100 GbE) (\(D\): HERD [74]) \(\quad\circ\) AMD/Xilinx ZCU106 Evaluation-Board (10 GbE): Clio [63] (\(E\)) —Re-implementation of PANIC [81] to HTG-9200 as baseline. —Evaluation: Chains with dummies (\(d\)) and chains consisting of tasks (\(r\)), including: Firewall, KV-Cache, NAT, load balancing, forwarding, and AES. —Parallelism: \(S1\) \(=\) None, \(S2\) \(=\) DAG Parallel, \(S3\) \(=\) \(S2\) \(+\) Instance Parallel \({}^{*}\) \(A+E\) here with caching NT (also included in the paper without) | —Achieved frequency for most modules: 250 MHz —Minimum latency (ingress to egress): \(1.3{\mu}\textrm{s}\) —Beneficial to have short running NTs together in a single CU. —Time sharing improves area utilization compared to DRF [62] only. —Achieved throughput: 100 Gbps (linerate) —Throughput (\(s/d\), \(S1\) vs. \(S2\)): \(\color{red}\rightarrow\) \(\approx 1x\) —Throughput (\(s/d\), \(S1/S2\) vs. \(S3\)): \(\color{green}\uparrow\) \(\approx 1x-1.5x\) —Latency (\(s/d\) and \(s/r\), PANIC): \(\color{green}\downarrow\) \(\approx 1x-0.6x\) —Throughput (\(s/r\), PANIC): \(\color{red}\rightarrow\) \(\approx 1x\) — \(A+E\) vs. \(E\): Lat. \(\color{green}\downarrow\) \(\approx 0.6x-0.8x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\) — \(A+E\) vs. \(B/C\): Lat. \(\color{green}\downarrow\) \(\approx 0.5x-0.6x\); Thp. \(\color{green}\uparrow\) \(\approx 1.2x-1.8x\) — \(A+E\) vs. \(D\): Lat. \(\color{green}\downarrow\) \(\approx 0.3x-0.4x\); Thp. \(\color{green}\uparrow\) \(\approx 2.9x-3.8x\) |
MTPSA [109] | —SW-Target: BMv2 (Functional evaluation in Simulation with Mininet.) —HW-Target: NetFPGA SUME (L2-Switch) \(\quad\circ\) Traffic Source: OSNT [36] \(\quad\circ\) Packet size: 64B–1,518B (Results only for 64B reported). —Comparison with P4 \(\rightarrow\) NetFPGA PSA as reference design. —Comparison with MTPSA without (MTPSA\({}_{0}\)) and up to eight user programs (MTPSA\({}_{x},x\in\{2,3,4,8\}\)) | —Achieved maximum throughput of NetFGPA SUME (40 Gbps). —Latency (Ref. \(\rightarrow\) MTPSA \({}_{0}\):): \(\color{red}\uparrow\) \(1.7{\mu}\textrm{s}\rightarrow 2.52{\mu}\textrm{s}\) —Latency (MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\)): \(\color{red}\uparrow\) \(2.52\mu\textrm{s}\rightarrow 3.23{\mu}\textrm{s}-3.3{\mu}\textrm{s}\) —Relatively stable latency \(\rightarrow\) (Mainly) user program dependent. —Ref. \(\rightarrow\) MTPSA \({}_{0}\): Logic \(\color{red}\uparrow\) 10%; Memory \(\color{red}\uparrow\) \(7.65\%\) —MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Logic, per program): \(\color{red}\uparrow\) \(5.9\%-7.4\%\) —MTPSA \({}_{0}\) \(\rightarrow\) MTPSA \({}_{x}\) (Memory, per program): \(\color{red}\uparrow\) \(5.4\%-6.3\%\) |
Tb Sw. Virtual. [102] | —ASIC logic (65 nm): Synopsis Design Compiler —FPGA logic: AMD/Xilinx XCVU13P \(\quad\circ\) Generated vSwitch instances with P4 \(\rightarrow\) NetFPGA. \(\quad\circ\) Use case: L2-Switch (26x), Firewall (17x), Router (14x), INT (14x) | —Frequency (ASIC): 1 GHz; Total Area: \(47.6 \mathrm{mm}^{2}\); Power: \(28.3 \mathrm{W}\) —Frequency (FPGA): 718.4 MHz —Throughput (per instance): \(129.61-132.63\) Gbps —Possible throughput (total): \(\approx 1.43-3.45\) Tbps (Max: \(3.2\) Tbps) |
hXDP [46] | —Host: Intel Xeon E5-1630 v3 —Netronome NFP-4000 @800 MHz: XDP offload —NetFGPA-SUME: hXDP @156 MHz —Evaluation applications (Firewall and Katran) and Microbenchmarks. —Evaluation Packet Forwarding: 64B–1,518B | —\(\approx\) 18% of logic resource utilization —Throughput (Applications): \(\quad\circ\) Host @2.1 GHz: \(\color{green}\uparrow\) \(\approx 1.08x\) – \(\approx 1.55x\) \(\quad\circ\) Host @3.7 GHz: \(\color{red}\downarrow\) \(\approx 0.88x\) – \(\approx 0.62x\) —Forwading Latency (64B–1,518B): \(\quad\circ\) Host @3.7 GHz: \(\color{green}\downarrow\) \(\approx 1/8.3x\) – \(\approx 1/10.7x\) \(\quad\circ\) NFP-4000: \(\color{green}\downarrow\) \(\approx 1/1.1x\) – \(\approx 1/3.5x\) —hXDP has higher throughput for TX- or redirection. |
hXDP\(+\) WE [42] | —AMD/Xilinx Alveo U50 @250 MHz: hXDP —AMD/Xilinx Alveo U50 @250 MHz: hXDP \(+\) WE | —Logic (hXDP): \(\color{red}\uparrow\) \(\approx\) 51.4% (\(\approx\) 13.3% total) —Memory (hXDP): \(\color{red}\uparrow\) \(\approx\) 43.5% (\(\approx\) 11.5% total)a —Instruction reduction (for hXDP): \(\color{green}\downarrow\) \(\approx\) 16.3% - 100%. —Throughput (hXDP only): \(\color{green}\uparrow\) \(\approx 1.2x-3.1x\) \(\quad\circ\) Complete offload of Suricata to WE: \(\color{green}\uparrow\) \(\approx 18.2x\) —Latency (hXDP only): \(\color{red}\uparrow\) \(\approx 1.01x-1.1x\) \(\quad\circ\) Exception Katran (\(\color{green}\downarrow\) \(\approx 0.98x\)) |
eHDL [97] | —AMD/Xilinx Alveo U50: eHDL —AMD/Xilinx Alveo U50: hXDP —AMD/Xilinx Alveo U50: P4-SDNet —NVIDIA/Mallenox BlueField 2 | —Throughput (SDNet)b: \(\color{red}\rightarrow\) 1x —Throughput (hXDP): \(\color{green}\uparrow\) \(\approx 27.4x-164.4x\) —Throughput (BlueField 2, 4 Cores)c: \(\color{green}\uparrow\) \(\approx 11.7x-23.4x\) —Latency (Avg., hXDP): \(\color{red}\uparrow\) \(\approx 1.03x\) \(\quad\circ\) \(\color{green}\downarrow\) \(\approx 0.9x\) (Firewall) \(-\) \(\color{red}\uparrow\) \(\approx 1.2x\) (Router) —Latencyd (BlueField 2): \(\color{green}\downarrow\) \(\approx 0.1x\) |
Taurus [113] | —2 Server, each with an Intel Xeon Gold 6248 @2.5 GHz \(\quad\circ\) MoonGen: Traffic generator and traffic sink. —Taurus Switch Prototype: \(\quad\circ\) Wedge 100 BF-32X (Tofino Switch) \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation) —Case study: Anomaly Detection \(\quad\circ\) Inference on the CGRA \(\quad\circ\) Baseline inference on the control plane \(\quad\circ\) 5 Gbps traffic \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps \(\quad\circ\) Sampling rate between 100 Kbps and 100 Mbps | —Estimated area overhead of \(3.8\%\) for ASIC implementation —Latency: \(\quad\circ\) Baseline: 34 ms (100 Kbps) - 512 ms (100 Mbps) \(\quad\circ\) Taurus (Avg.): 122 ns —Detection rate: \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 2.6% (1 Mbps) \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 58.2% (100 Mbps - 100 Kbps) —F1 score: \(\quad\circ\) Baseline: 0% (100 Mbps) - \(\approx\) 4.9% (1 Mbps) \(\quad\circ\) Taurus: \(\color{green}\uparrow\) 71.1% (100 Mbps - 100 Kbps) —Online Training convergence fastest with higher sampling rate (\(10^{-2}\)), higher number of epochs (10) and small batch sizes (64). |
Homunculus [114] | —2 Server: Intel Xeon Gold 6248 @ 2.5 GHz \(\quad\circ\) MoonGen: Traffic generator and traffic sink. —Taurus Switch Prototype: \(\quad\circ\) Wedge 100BF-32X (Tofino Switch) \(\quad\circ\) AMD/Xilinx Alveo U250 (CGRA implementation) —Case studies: Anomaly Detection, Traffic Classification, and Botnet Chatter Detection. \(\quad\circ\) Ideal F1 score calculated offline in SW. | —Achieved line rate for all three applications. —Achieved ideal F1 score for all three applications. |
NetReduce [85] | —Single-GPU: six machines each \(\quad\circ\) 2x Intel Xeon E5-2064 10x @2.4 GHz \(\quad\circ\) 3x 32GB DDR4 \(\quad\circ\) NVIDIA Geforce RTX 2080 8 GB \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) —Multi-GPU: four machines each \(\quad\circ\) 2x Intel Xeon Gold 6154 18x @3.00 GHz \(\quad\circ\) 16x 64GB DDR4 \(\quad\circ\) 8x NVIDIA Tesla V100 SXM2 32 GB, \(\quad\circ\) NIC: NVIDIA/Mellanox ConnectX-5 (100 GbE) —Evaluation: \(\quad\circ\) Image Classification (ImageNet dataset): AlexNet, VGG16, and ResNet50. \(\quad\circ\) NLP: BERT, GPT-2, MNLI, QNLI, QQP and SQuAD \(\quad\circ\) Comparison with RAR, SwitchML, FAR, and TAR. | —Throughput (Single-GPU): \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.05x\ -\approx 1.45x\) \(\quad\circ\) ImageNet, SwitchML: \(\color{green}\uparrow\) \(\approx 1x\ -\approx 1.25x\) \(\quad\circ\) NPL, RAR: \(\color{green}\uparrow\) \(\approx 1.22x\ -\approx 1.43x\) —Throughput (Multi-GPU): \(\quad\circ\) ImageNet, FAR: \(\color{green}\uparrow\) \(\approx 1.15x\ -\approx 1.69x\) \(\quad\circ\) ImageNet, TAR: \(\color{green}\uparrow\) \(\approx 1.12x\ -\approx 1.58x\) —Communication Improvement (Multi-GPUe): \(\quad\circ\) ImageNet, RAR: \(\color{green}\uparrow\) \(\approx 1.16x\ -\approx 1.34x\) —Accuracy Lost (Single-GPU) \(\quad\circ\) ImageNet, RAR: \(\color{red}\downarrow\) \(\approx 0.2\%\ -\approx 1.5\%\) |
Pigasus [132] | —Intel i7-4790 @3.60 GHz: Snort 3 SW only —Intel i9-9960X @3.1 GHz: Snort 3 SW \(\quad\circ\) Intel Stratix 10 MX: Pigasus | —Number of Snort cores for 100 Gbps: \(\quad\circ\) IDS: \(\color{green}\downarrow\) 23–185x less than SW only \(\quad\circ\) IPS: \(\color{green}\downarrow\) 23–200x less than SW only —Latency (SW only): \(\color{green}\downarrow\) \(1/3x\) \(-\) \(1/10x\) \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\) —Estimated Power consumption: 49–166 W \(\quad\circ\) Compared to SW only: \(\color{green}\downarrow\) \(1/13x\) \(-\) \(1/59x\) |
9.1 SmartNIC and Switch Designs for Multi-Domain Applications on FPGAs
9.2 Investigation of SmartNIC Designs for eBPF/XDP Offloading to FPGAs
9.3 Proposals for Different Application Domains Realized on FPGAs
9.4 Lessons Learned and Open Questions
10 Additional Open Research Challenges
11 Conclusion
Acknowledgment
Footnotes
References
Index Terms
- A Survey on Architectures, Hardware Acceleration and Challenges for In-Network Computing
Recommendations
A hardware implementation of distributed network protocol
This paper presents a promising solution for the implementation of distributed network protocol version 3 (DNP3). DNP3 is the communication protocol widely used for the interoperability between control devices in the power utility industry. DNP3 has ...
DBHI: A Tool for Decoupled Functional Hardware-Software Co-Design on SoCs
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThis paper presents a system-level co-simulation and co-verification workflow to ease the transition from a software-only procedure, executed in a General Purpose processor, to the integration of a custom hardware accelerator developed in a Hardware ...
Co-simulation framework of SystemC SoC virtual prototype and custom logic (abstract only)
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysTo address the increasing demand of System-on-Chip (SoC) for high performance applications and IP programmability, specialized SoC with custom logic is developed in a single chip or multi-chip system. Like any other SoC platforms, early software ...
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- Joint project 6G-life
- German Research Foundation (DFG, Deutsche Forschungsgemeinschaft)
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 676Total Downloads
- Downloads (Last 12 months)676
- Downloads (Last 6 weeks)346
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in