Better visibility into packet-dropping decisions

By Jonathan Corbet
February 25, 2022

Dropped packets are a fact of life in networking; there can be any number of reasons why a packet may not survive the journey to its destination. Indeed, there are so many ways that a packet can meet its demise that it can be hard for an administrator to tell why packets are being dropped. That, in turn, can make life difficult in times when users are complaining about high packet-loss rates. Starting with 5.17, the kernel is getting some improved instrumentation that should shed some light on why the kernel decides to route packets into the bit bucket.

This problem is not new, and neither are attempts to address it. The kernel currently contains a "drop_monitor" functionality that was introduced in the 2.6.30 kernel back in 2009. Over the years, it has gained some functionality but has managed to remain thoroughly and diligently undocumented. This feature appears to support a netlink API that can deliver notifications when packets are dropped. Those notifications include an address within the kernel showing where the decision to drop the packet was made, and can optionally include the dropped packets themselves. User-space code can turn the addresses into function names; desperate administrators can then dig through the kernel source to try to figure out what is going on.

It seems like there should be a better way. As it happens, the beginning of the infrastructure to provide that better way was contributed to 5.17 by Menglong Dong. The internal kernel function that frees the memory holding a packet is kfree_skb(); in 5.17, that function has become:

    void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);

The reason argument is new; it is intended to say why the packet passed as skb has reached the end of the line. This information is not actually useful to the kernel, but it has been added to the existing kfree_skb tracepoint, making it available to any program that connects to that tracepoint. Analysis scripts can quickly print out why packets are being dropped; administrators can also attach BPF programs to, for example, create a histogram of reasons for dropped packets.

A new version of kfree_skb() has also been added; it simply calls kfree_skb_reason() with "unspecified" as the reason.

In 5.17, the use of this infrastructure is relatively limited. There are a few TCP-level drop locations that have been instrumented with the new call, including code that drops packets for being smaller than the TCP header size, not being associated with an existing TCP socket, exhibiting checksum failures, or having been explicitly dropped by an add-on socket filter program. The UDP subsystem has also been enhanced to note those same reasons for dropped packets.

The situation is set to improve considerably in 5.18; patches already in linux-next add a number of new reasons. These document packets dropped by the netfilter subsystem, that contain IP-header errors, or have been identified as a spoofed packet by the reverse-path filter (rp_filter) mechanism. Administrators will be able to see IP packets that have been dropped due to an unsupported higher-level protocol. Reasons have also been added for UDP packets dropped by the IPSec XFRM policy or a lack of memory within the kernel.

There is yet another set of reason annotations that has been accepted, but which has not yet found its way into linux-next; chances are that these will show up in 5.18 as well. They extend the XFRM-policy annotation to TCP, note packets dropped due to missing or incorrect MD5 hashes (which are evidently still a thing in 2022), as well as those containing invalid TCP flags or sequence numbers outside of the current TCP window. These patches also add new instances of the other reasons noted above; some situations can be detected in multiple places.

While the above set of reasons may seem long, this work could be seen as having just begun. In current linux-next, there are over 2,700 calls to kfree_skb(), compared to 18 to kfree_skb_reason(). That suggests that a lot of packets will still be dropped for unspecified reasons. Still, this work represents a useful step forward, one that should make many of the reasons for packet loss more readily available to system administrators.

The part that remains missing, of course, is the user-space side. The current reason codes are all defined in <linux/skbuff.h>, which is not part of the externally available kernel API. Moving them to a separate file under the uapi directory would make them more accessible to developers. Also helpful, of course, would be to have some documentation for this mechanism and how to use it (and interpret the results), but even your editor, often cited for naive optimism, will not be holding his breath for that to show up.

Meanwhile, though, an important piece of the kernel's network functionality is becoming a little more transparent to users. That should make life easier for system administrators who will be able to spend less time trying to figure out why packets aren't making it through the system. Unfortunately, though, this work offers no help for users who are wondering why their packets are disappearing somewhere in the far reaches of the Internet.

Index entries for this article
Kernel	Networking

Better visibility into packet-dropping decisions

Posted Feb 25, 2022 20:29 UTC (Fri) by atnot (subscriber, #124910) [Link] (8 responses)

Has this been considered for other things too? I regularly find myself wishing something like this existed for figuring out which of the many mechanism an EPERM/EACCES was caused by (unix permissions, acl, selinux and other LSMs, file systems, dm layers, cgroups, namespaces, seccomp, capabilities, API misuse, ...)

Better visibility into packet-dropping decisions

Posted Feb 26, 2022 2:04 UTC (Sat) by shemminger (subscriber, #5739) [Link] (2 responses)

Netlink was enhanced to provide error messages (not just errno).
Many places have it, but lots still need work -- volunteers wanted.

Better visibility into packet-dropping decisions

Posted Feb 26, 2022 5:52 UTC (Sat) by tititou (subscriber, #75162) [Link] (1 responses)

Hi,
Can you provide a link or an example about it ?

Better visibility into packet-dropping decisions

Posted Feb 26, 2022 19:03 UTC (Sat) by johill (subscriber, #25196) [Link]

Check out commit 2d4bc93368f5a ("netlink: extended ACK reporting") which added the bare minimum infrastructure a long time ago, and you can find many users of NL_SET_ERR_MSG/GENL_SET_ERR_MSG (and similar macros) these days.

It supports reporting a string (error message), a pointer to a bad attribute, and if NL_SET_ERR_MSG_ATTR_POL was used (which it is in the general policy-based parsing) will even return the policy for the attribute back to userspace to explain why the attribute failed (e.g. if it's NLA_RANGE(U32, 1,2) and you gave a value 3).

return -Exxxxx;

Posted Feb 26, 2022 15:20 UTC (Sat) by jreiser (subscriber, #11027) [Link] (4 responses)

There is a need for a facility to locate at run time every failed subroutine call. The source code be edited with sed so that return -Exxxxx; becomes return ErrorCode(Exxxxx); with a default macro definiton something like

     #ifndef ErrorCode
     #define ErrorCode(errnum) -(errnum)
     #endif

Then the determined investigator can re-compile selected source files with something like

     #define ErrorCode(errnum) myErrorDiagnostic(errnum, __builtin_return_address(0), __FUNCTION__, __LINE__)

and supply a definition for the added subroutine myErrorDiagnostic. Of course there are a handful of cases where error numbers are variables or the syntax is complex, and also a few places where simple automated editing fails. Rate limiting the reporting can be a problem. But I did this once, and got the answer I wanted.

return -Exxxxx;

Posted Feb 26, 2022 19:05 UTC (Sat) by johill (subscriber, #25196) [Link] (2 responses)

In most files you can even just

#define EINVAL ({printk(...); 22;})

if you really want :-)

return -Exxxxx;

Posted Feb 27, 2022 3:21 UTC (Sun) by roc (subscriber, #30627) [Link] (1 responses)

That would surely fail to build with EINVAL being used in a case label.

return -Exxxxx;

Posted Feb 27, 2022 9:17 UTC (Sun) by jengelh (subscriber, #33263) [Link]

Good thing the main kernel has just two `case EINVAL` across its ~30 million lines.

return -Exxxxx;

Posted Mar 11, 2022 8:44 UTC (Fri) by njs (guest, #40338) [Link]

Someone actually implemented this and released the patches so you can to:

https://github.com/nviennot/linux-trace-error

Better visibility into packet-dropping decisions

Posted Feb 26, 2022 4:49 UTC (Sat) by alison (subscriber, #63752) [Link] (1 responses)

Assuredly knowing when packets are dropped because NAPI polling isn't keeping up with what's incoming would be valuable. Yeah, I'm sure that patches and test data would be welcome.

Better visibility into packet-dropping decisions

Posted Feb 27, 2022 21:26 UTC (Sun) by shemminger (subscriber, #5739) [Link]

In order to see packets dropping because CPU can't keep up you have to look at the hardware statistics.
This is reported in rx_missed. Not sure if there more that HW can tell you.
There are lots of rx_dropped places in drivers, these could/should be instrumented.

Better visibility into packet-dropping decisions

Posted Feb 27, 2022 23:43 UTC (Sun) by amarao (subscriber, #87073) [Link] (3 responses)

Md5 for TCP is really a single good protection against RST attacks on BGP. You can filter ingress, but there always is a risk to miss something. Having MD allow to have month-long tcp session without risks of malicious rst.

Better visibility into packet-dropping decisions

Posted Mar 2, 2022 3:25 UTC (Wed) by MaZe (subscriber, #53908) [Link] (2 responses)

eh, most uses of tcp md5 are pretty pointless because they just use well known passwords...

Better visibility into packet-dropping decisions

Posted Mar 2, 2022 9:58 UTC (Wed) by amarao (subscriber, #87073) [Link]

I do understand you. When a new session is agreed with a party, a password is provided together with IP and AS number. Even md5 is considered hopelessly broken, for the sake of RST protection it is more than enough, because even 32 additional bits pushes attack from `feasible` to `unfeasible` realm.

Better visibility into packet-dropping decisions

Posted Jul 7, 2022 6:48 UTC (Thu) by gdt (subscriber, #6284) [Link]

Even using a silly MD5 password is worthwhile, since the spray of failed MD5 packets (and thus log messages) prior to the BGP connection reset make it plain that the cause is network abuse.
Cynically, if the BGP connection isn't using a long, random, unique key prior to that outage, then it will be afterwards :-)

Linux counting failed MD5 packets is excellent, as network operators investigating BGP connection issues can check that the counter is the expected zero.

For the longest time vendors were promoting IPsec as the replacement for the TCP MD5 option, but operationally the overhead of configuration and customer education was too high. More recently TCP-AO (Authentication Option) offers a similar mechanism to the MD5 option, but with modern cyrptographic algorithms.

For external BGP connections the TTL security check also offers good protection from network abuse. Customers generally seem to be able to configure that without much difficulty.

Better visibility into packet-dropping decisions

Posted Mar 6, 2022 20:16 UTC (Sun) by gfa (guest, #53331) [Link] (1 responses)

> The kernel currently contains a "drop_monitor" functionality that was introduced in the 2.6.30 kernel back in 2009

Does anybody know any tool that can use this functionality?

thanks

Better visibility into packet-dropping decisions

Posted Mar 9, 2022 17:55 UTC (Wed) by rstonehouse (subscriber, #81531) [Link]

See https://github.com/idosch/mlxsw-1/wiki/Packet-Drops-Monit... which talks about using https://github.com/nhorman/dropwatch

(Also there is a systemtap script to do something similar. See https://sourceware.org/git/?p=systemtap.git;a=blob;f=test...)