Accelerating networking with AF_XDP

By Jonathan Corbet
April 9, 2018

The Linux network stack does not lack for features; it also performs well enough for most uses. At the highest network speeds, though, any overhead at all is too much; that has driven the most demanding users toward specialized, user-space networking implementations that can outperform the kernel for highly constrained tasks. The express data path (XDP) development effort is an attempt to win those users back, with some apparent success so far. With the posting of the AF_XDP patch set by Björn Töpel, another piece of the XDP puzzle is coming into focus.

The core idea behind the XDP initiative is to get the network stack out of the way as much as possible. While the network stack is highly flexible, XDP is built around a bare-bones packet transport that is as fast as it can be. When a decision needs to be made or a packet must be modified, XDP will provide a hook for a user-supplied BPF program to do the work. The result combines minimal overhead with a great deal of flexibility, at the cost of a little "some assembly required" label on the relevant man pages. For users who count every nanosecond of packet-processing overhead (to the point that the 4.17 kernel will include some painstaking enhancements to the BPF JIT compiler that reduces the size of the generated code by 5%), figuring out how to put the pieces together is worth the trouble.

The earliest XDP work enabled the loading of a BPF program into the network interface device driver, with the initial use case being a program that dropped packets as quickly as possible. That may not be the most exciting application, but it is a useful feature for a site that is concerned about fending off distributed denial-of-service attacks. Since then, XDP has gained the ability to perform simple routing (retransmitting a packet out the same interface it arrived on) and, for some hardware, to offload the BPF program into the interface itself.

There are limits, though, to what can be done in the context of a network-interface driver; for such cases, AF_XDP is intended to connect the XDP path through to user space. It can be thought of as being similar to the AF_PACKET address family, in that it transports packets to and from an application with a minimum of processing, but this interface is clearly intended for applications that prioritize packet-processing performance above convenience. So, once again, some assembly is required in order to actually use it.

That assembly starts by calling socket() in the usual way with the AF_XDP address family; that yields a socket file descriptor that can (eventually) be used to move packets. First, however, it is necessary to create an array in user-space memory called a "UMEM". It is a chunk of contiguous memory, divided into equal-sized "frames" (the actual size is specified by the caller), each of which can hold a single packet. By itself, the UMEM looks rather boring:

After the memory has been allocated by the application, this array is registered with the socket using the XDP_UMEM_REG command of the setsockopt() system call.

Each frame in the array has an integer index called a "descriptor". To use those descriptors, the application creates a circular buffer called the "fill queue", using the XDP_UMEM_FILL_QUEUE setsockopt() call. This queue can then be mapped into user-space memory using mmap(). The application can request that the kernel place an incoming packet into a specific frame in the UMEM array by adding that frame's descriptor to the fill queue:

Once a descriptor goes into the fill queue, the kernel owns it (and the associated UMEM frame). Getting that descriptor back (with a new packet in the associated frame) requires creating yet another queue (the "receive queue"), with the XDP_RX_QUEUE setsockopt() operation. It, too, is a circular buffer that must be mapped into user space; once a frame has been filled with a packet, its descriptor will be moved to the receive queue. A call to poll() can be used to wait for packets to arrive in the receive queue.

A similar story exists on the transmit side. The application creates a transmit queue with XDP_TX_QUEUE and maps it; a packet is transmitted by placing its descriptor into that queue. A call to sendmsg() informs the kernel that one or more descriptors are ready for transmission. The completion queue (created with XDP_UMEM_COMPLETION_QUEUE) receives descriptors from the kernel after the packets they contain have been transmitted. The full picture looks something like this:

This whole data structure is designed to enable zero-copy movement of packet data between user space and the kernel, though the current patches do not yet implement that. It also allows received packets to be retransmitted without copying them, since any descriptor can be used for either transmission or reception.

The UMEM array can be shared between multiple processes. If a process wants to create an AF_XDP socket attached to an existing UMEM, it simply passes its socket file descriptor and the file descriptor associated with the socket owning the UMEM to bind(); the second file descriptor is passed in the sockaddr_xdp structure. There is only one fill queue and one completion queue associated with the UMEM regardless of how many processes are using it, but each process must maintain its own transmit and receive queues. In other words, in a multi-process configuration, it is expected that one process (or thread) will be dedicated to the management of the UMEM frames, while each of the others takes on one aspect of the packet-handling task.

There is one other little twist here, relating to how the kernel chooses a receive queue for any given incoming packet. There are two pieces to that puzzle, the first of which is yet another new BPF map type called BPF_MAP_TYPE_XSKMAP. This map is a simple array, each entry of which can contain a file descriptor corresponding to an AF_XDP socket. A process that is attached to the UMEM can call bpf() to store its file descriptor in the map; what is actually stored is an internal kernel pointer, of course, but applications won't see that. The other piece is a BPF program loaded into the driver whose job is to classify incoming packets and direct them to one of the entries in the map; that will cause the packets to show up in the receive queue corresponding to the AF_XDP socket in the chosen map entry.

Without the map and BPF program, an AF_XDP socket is unable to receive packets. You were warned that some assembly was required.

The final piece is a bind() call to attach the socket to a specific network interface and, probably, a specific hardware queue within that interface. The interface itself can then be configured to direct packets to that queue if they should be handled by the program behind the AF_XDP socket.

The intended final result is a structure that can enable user-space code to perform highly efficient packet management, with as much hardware support as possible and with a minimum of overhead in the kernel. There are some other pieces that are needed to get there, though. The zero-copy code is clearly one of them; copying packet data between the kernel and user space is fatal in any sort of high-performance scenario. Another one is the XDP redirect patch set being developed by Jesper Dangaard Brouer; that functionality is what will allow an XDP program to direct packets toward specific AF_XDP sockets. Driver support is also required; that is headed toward mainline for a couple of Intel network interfaces now.

If it all works as planned, it should become possible to process packets at a much higher rate than has been possible with the mainline network stack so far. This functionality is not something that many developers will feel driven to use, but it is intended to be appealing to those who have resorted to user-space stacks in the past. It is a sign of an interesting direction that kernel development has taken: new functionality is highly flexible, but using it requires programming for the BPF virtual machine.

Index entries for this article
Kernel	Networking/eXpress Data Path (XDP)

Accelerating networking with AF_XDP

Posted Apr 10, 2018 6:06 UTC (Tue) by mjthayer (guest, #39183) [Link] (1 responses)

I wonder (as someone who does not know enough about networking) whether this will be usable in combination of in-device processing (e.g. a card's hardware TCP/IP stack) as well in future. I seem to recall that Linux preferred to do as much of that processing on the CPU as possible until recently, but I don't know whether that is changing or likely to change - I would expect that using specialised network processing hardware has at least the potential to be a performance benefit. I am sure that the answers will enlighten me...

Accelerating networking with AF_XDP

Posted Apr 11, 2018 7:14 UTC (Wed) by mjthayer (guest, #39183) [Link]

Seems to be partly answered in the next two comments.

Accelerating networking with AF_XDP

Posted Apr 10, 2018 12:41 UTC (Tue) by sbates (subscriber, #106518) [Link] (3 responses)

Nice overview. I guess this is intended to appeal to users of DPDK and the like.

“and, for some hardware, to offload the BPF program into the interface itself.”

Does anyone have a link to which devices support this and any patches needed to enable this. Offloading arbitrary BPF programs to the NIC sounds very interesting given the path the kernel seems to be taking.

Hardware support

Posted Apr 10, 2018 12:52 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

Much of the available information seems to be in the form of slidware, but Netronome has a card that does BPF offload. See this netdev talk [PDF] or this FOSDEM talk [PDF].

Hardware support

Posted Apr 11, 2018 12:02 UTC (Wed) by sbates (subscriber, #106518) [Link]

Thanks! I also found this a rather awesome resource.

https://qmonnet.github.io/whirl-offload/2016/09/01/dive-i...

And yes, from what I can see, only Netronome offer kernel support for HW offload of eBPF programs today (but I'd love to be corrected if this is not true).

Accelerating networking with AF_XDP

Posted Apr 11, 2018 20:59 UTC (Wed) by RamiRosen (guest, #37330) [Link]

Regarding your comment about appealing to DPDK users: actually it seems that you are right, and a patchset for PMD driver for
AF_XDP was sent to the dpdk-dev mailing list about a month ago. See:

http://dpdk.org/ml/archives/dev/2018-February/091502.html

Rami Rosen

Accelerating networking with AF_XDP

Posted Apr 10, 2018 17:35 UTC (Tue) by blitzkrieg3 (guest, #57873) [Link] (1 responses)

Will stuff like libonload eventually use this? Or are they to remain separate forever?

Accelerating networking with AF_XDP

Posted Apr 10, 2018 21:25 UTC (Tue) by blitzkrieg3 (guest, #57873) [Link]

Looks like it according to page here:

https://www.solarflare.com/ultra-low-latency

related writeup on getting packets to user-space

Posted Apr 19, 2018 8:23 UTC (Thu) by wingo (guest, #26929) [Link]

Nice article. I wrote up some notes on Magnus Karlsson's FOSDEM talk on AF_XDP here (scroll down to the second talk):

https://wingolog.org/archives/2018/02/05/notes-from-the-f...

I thought the following talk by François-Frédéric Ozog was an interesting counterpoint. In any case, if the kernel can get packets to userspace in a fast, generic way, that would definitely be a step forward.

Accelerating networking with AF_XDP

Posted May 5, 2020 14:38 UTC (Tue) by f18m (guest, #133856) [Link] (1 responses)

Hi Jonathan,
Thanks for this very interesting article.
One question though: is AF_XDP still hindered by interrupts? I mean in high-performance applications the "ksoftirqd" thread will jump up to 100% of CPU usage in scenarios where you're receiving a lot of packets per second... on 100Gbps the theoretical max is 148 MPPS... unless you use DPDK framework, which does use polling instead of interrupts, you will never achieve that PPS rates on Linux. Is AF_XDP using a polling mechanism or relies on interrupts?

Thanks!

Accelerating networking with AF_XDP

Posted Jan 22, 2023 17:15 UTC (Sun) by dankamongmen (subscriber, #35141) [Link]

irq load will be a function of how often NAPI posts from its loop, which can be controlled with sysctls.