TCP small queues and WiFi aggregation — a war story

June 18, 2018

Contributed by Carlo Grazia and Natale Patriciello

This article describes our findings that connected TCP small queues (TSQ) with the behavior of advanced WiFi protocols and, in the process, solved a throughput regression. The resulting patch is already in the mainline tree, so before continuing, please make sure your kernel is updated. Beyond the fix, it is delightful to travel through history to see how we discovered the problem, how it was tackled, and how it was patched.

The academic life is full of rewards; one of ours was the moment in which three USB/WiFi 802.11ab/g/n dongles arrived. We bought dongles with an Atheros chipset because both the software driver and the firmware are available and modifiable. We were using the ath9k_htc kernel module with the default configuration. We compiled the latest (at the time) available kernel (4.13.8), and then we started the access point to create an 802.11n network to build the core of our future testbed for vehicular communications.

We started some simple tests with ping and iperf to check the connectivity, the distribution of IP addresses, and our custom DNS, which translates the names of our services into IP addresses. The nominal transfer rate of the dongles is 150Mb/s, but what we saw on the screen was disappointing: an upload iperf connection, no matter which options were used, was able to reach only 40Mb/s. Using another operating system as a client, we were able to achieve 90Mb/s, leaving out a problem with the server. Even with the newer kernel release (4.14), we did not see anything in the kernel messages that would have been correlated with a hardware or a driver failure. To stress-test the equipment, we started a UDP transmission at a ludicrous speed. Not so surprisingly, we arrived almost at 100Mb/s. It was clear that the root of the problem was in the TCP module or its interactions with the queueing disciplines, so the journey began.

The next step involved the tc command. We started listing the default queueing discipline and modifying its parameters. By default, we were using the mq queuing discipline, which instantiated an FQ-Codel queuing discipline for each outgoing hardware queue. With another driver, such as ath9k, the entire queuing layer is bypassed and a custom version of it, without the possibility of tuning or modifying the queueing discipline, is deployed inside the kernel WiFi subsystem. With ath9k_htc driver, instead, we still had the chance to play with the queuing discipline type and parameters. We opted for using the most basic (but reliable) discipline, pfifo_fast. But nothing changed.

We were using the default CUBIC congestion-control module. Despite the recent hype around BBR, we decided to stick with CUBIC because it has always just worked and never betrayed us (until now, as it seems). Just for a try, we switched to BBR, but things got worse than before; the throughput dropped by 50%, never passing the 20Mb/s line. To do all the tests, we employed Flent, which also gives latency results. All the latencies were low; we never exceeded a couple of milliseconds of delay. In our experience, a low throughput with a low latency indicates a well-known problem: starvation. So the question became: what was limiting the number of segments transmitted by the client?

In 2012, with commit 46d3ceabd8d9, TCP small queues were introduced. Their initial objective was to prevent TCP sockets from queuing more than 128KB of data in the network stack. In 2013, the algorithm was updated to have a dynamic limit. Instead of the fixed value, the limit was defined as either two segment's worth of data or an amount of data that corresponds to a transmission time of 1ms at the current (guessed) transmission rate. The calculation of the transmission rate was added some months earlier, with the objective of calculating the proper sizing of segments when TCP segmentation offload is in use, along with the introduction of a packet scheduler (FQ) able to spread out the sent segments over an interval.

However, the first reports suggested that the amount of data queued was too low in some subsystems, such as WiFi. The reason behind this was the impossibility, for the WiFi driver, of performing frame aggregation, due to the lack of data in the driver's queue. The aggregation technique combines multiple packets into a single frame to reduce the constant overhead for over-the-air transmission. Preventing aggregation is a sure way to wreck throughput.

In response, a minimum amount of buffering (128KB) was restored in commit 98e09386c0ef4. One year later, a refactoring patch for segmentation offload sizing introduced a small modification that, as we will see, changed the situation dramatically. The 128KB value was changed from being a lower bound to an upper bound. If the amount of data queued was forced to be less than 128KB, what would happen to the WiFi aggregation?

Fast forwarding to the 4.14 kernel, we started to think how to tune these thresholds. First of all, the function that decides (even in recent kernels) how much TCP data is allowed to enter the network stack is tcp_small_queue_check():

    static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
    	   			      unsigned int factor)
    {
	unsigned int limit;

	limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
	limit = min_t(u32, limit, sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
	limit <<= factor;

	/* ... */
    }

The limit is calculated as the maximum of two full-size segments and ~1ms of data at the current rate. Then the minimum of this value and the 128KB threshold is used (to be in sync with the kernel history, we must say the default value was raised to 256KB in 2015). We started to wonder what would happen if we patched out the possibility of setting a lower bound on the amount of data that could be enqueued. We then modified in the most obvious way the above function to get the following results:

The first column represents the results using the pre-fix parameters for TSQ (two segments or ~1ms of data at the current rate). In the second, we forced at least 64KB to be queued. As we can see, the throughput increased by 20Mb/s, but also the delay (even if the latency increase is not as pronounced as the throughput increase). Then we tested the original configuration in which the lower bound was 128KB; the throughput exceeds the 90Mb/s value, with an added latency of 2ms. It is enough to have 128KB of data queued to have the proper aggregation behavior, at least with our hardware. Increasing that value (we plotted up to 10MB) does not improve in any way the throughput, but it worsens the delay. Even the case in which the TSQ is entirely disabled did not add any improvement to the situation. We found the cause of the problem: a minimum value of data should be enqueued to ensure that frame aggregation works.

After the testing phase, we realized that putting back fixed byte values would be the wrong choice because, for slow flows, we would only have increased the latency. But, thanks to the modifications done to support BBR, we do know what the flow's current rate is: why not use it? In fact, in commit 3a9b76fd0db9f, pushed at the end of 2017, the logic of TSQ was enriched by the possibility, for a device driver, to increase the number of milliseconds worth of data that can be queued. The best value for throughput that worked in all the hardware we tested was in between 4-8ms at the flow rate. So, we shared our results, and some weeks later a patch was accepted. In your latest kernel, thanks to commit 36148c2bbfbe, your WiFi driver can allow TCP to queue enough data to solve the aggregation problem with a negligible impact on latency.

The networking stack is complicated (what is simple in kernel space?). For sure, it is not an opaque black box, but instead, it is an orchestrated set of different pieces of knowledge, reflected into layers that can, sometimes, make incompatible choices. As a lesson, we learned that the relationship between latency and throughput on different technologies is not the same, and aggregation in wireless technologies is more common than we initially thought. Moreover, as a community, we should start thinking about automated tests that can give an idea of the performance impact of a patch under different technologies and in a wide range of contexts, from the 40Gb/s device of a burdened server to the 802.11ab/g/n USB dongle connected to a Raspberry Pi.

[The authors would like to thank Toke Høiland-Jørgensen for his support and the time he dedicated to the Flent tool, to the WiFi drivers, and to gather the results from the ath9k and ath10k drivers.]

Index entries for this article
Kernel	Networking/Wireless
GuestArticles	Augusto Grazia, Carlo

TCP small queues and WiFi aggregation — a war story

Posted Jun 19, 2018 3:35 UTC (Tue) by pabs (subscriber, #43278) [Link] (7 responses)

Which USB/WiFi 802.11ab/g/n dongles were used?

TCP small queues and WiFi aggregation — a war story

Posted Jun 19, 2018 6:56 UTC (Tue) by Beolach (guest, #77384) [Link] (5 responses)

I don't know what specific dongles they used, but the main thing to look for is the chipset used, which as mentioned in the article were Atheros chipsets supported by the ath9k / ath9k_htc drivers. Even if you get a different brand of dongle, if it has an Atheros chipset supported by these drivers you should be able to get similar results. As the article mentioned, these ath9k chipsets are fully open-source, both w/ the driver software & the firmware, which allows much easier & faster improvements like this. Dave Taht also used ath9k chipsets for his CeroWRT Bufferbloat / Make-WiFi-Fast projects, for this same reason.

Sadly, to the best of my knowledge there are no 802.11ac chipsets w/ both open-source drivers & firmware - I believe even the ath10k 802.11ac chipsets have closed firmware blobs. :-( I would love to hear about any I've missed.

TCP small queues and WiFi aggregation — a war story

Posted Jun 21, 2018 15:59 UTC (Thu) by mtaht (subscriber, #11087) [Link] (4 responses)

Thinner firmware for 802.11ac would be nice, particularly in an age when the main
cpu chip is multicore and much faster than anything built into the wifi chip.

but wifi has some really hard real-time constraints that require dedicated cpus onboard, and once you start doing that the temptation to wedge all your functionality there has thus far been overwhelming for vendors.

So far as I know qualcom's 802.11ac devices use a proprietary R/T OS inside.

Some chipsets (like quantenna's) actually wedge an entire linux stack into their chip.

qualcomms's LTE modems do also.

TCP small queues and WiFi aggregation — a war story

Posted Jun 21, 2018 16:43 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

Power efficiency seems important too - you don't want to wake up Linux on the big high-performance main CPU just so it can choose to ignore a packet, when you could run that logic in the firmware on the already-awake and just-fast-enough CPU in the networking chip instead.

And portability - you don't want to maintain multiple separate copies of your million lines of driver code (plus regression tests etc) for Linux, Windows, Apple, Fuchsia, the several obsolete Linux versions your customers still use, etc, when it's much easier to put almost all the code in firmware so the OS-specific driver is just a thin wrapper. Plus the Linux community will likely be much happier if you upstream that wrapper and leave the firmware opaque, than if you attempt to upstream your million lines of cross-platform-ish code that doesn't follow the Linux coding style and has an ugly abstraction layer for OS-specific bits.

None of that stops you making the thick firmware open source, though.

TCP small queues and WiFi aggregation — a war story

Posted Jun 21, 2018 19:05 UTC (Thu) by mtaht (subscriber, #11087) [Link]

I am pleased to say that hooks into qualcomm's 802.11ac firmware have appeared sufficient to mostly mitigate the bufferbloat problem they had there with the code at the mac80211 layer. Certainly I have hope their internal firmware will improve.

If I had any one wish for "smart firmware", it would be only that there was enough smarts in the wifi and lte hardware/firmware to handle no more than 4ms worth of queuing and real time processing, and let the kernel handle the rest within its constraints for interrupt latency.

BQL accomplished this for ethernet (sub-1ms there, actually)

fq_codel for wifi ( https://www.usenix.org/system/files/conference/atc17/atc1... ) gets it down to two aggregates, which can take up to ~5ms each, at any achieved "line" rate.

We can do better than this with better control of txops on the AP, and certainly the algorithms above can exist, offloaded, in smarter hardware, which is happening on several chipsets I'm aware of.

PS The new sch_cake actually can run ethernet, shaped to 1gbit, at lower latencies than anything that uses BQL - at a large cost in cpu overhead, but not as much as you might think - it works at that rate on a quad core atom, for example.

http://www.taht.net/~d/cake/rrul_be_-_cake-shaped-gbit-qu...

vs sch_fq:

http://www.taht.net/~d/cake/rrul_be_-_fq-quad-long-smooth...

The principal advantage of cake, here, even though now capable running at speeds and latencies like this, is to defeat other black box token bucket shapers on a link, however, at much lower rates, with a corresponding reduction in cpu cost to (at sub-100mbit) the level of "noise".

TCP small queues and WiFi aggregation — a war story

Posted Jun 22, 2018 9:52 UTC (Fri) by kronat (guest, #117266) [Link] (1 responses)

> Some chipsets (like quantenna's) actually wedge an entire linux stack into their chip.
> qualcomms's LTE modems do also.

Do you have a reference for these statements? I am investigating a similar problem in 3GPP networks. Thanks!

TCP small queues and WiFi aggregation — a war story

Posted Jun 22, 2018 23:40 UTC (Fri) by mtaht (subscriber, #11087) [Link]

the qualcomm work was first documented by a good ccc talk, the video for which I cannot find right now.

But: https://osmocom.org/projects/quectel-modems/wiki

and the slides from that talk: https://fahrplan.events.ccc.de/congress/2016/Fahrplan/sys...

TCP small queues and WiFi aggregation — a war story

Posted Jun 22, 2018 8:12 UTC (Fri) by cagrazia (guest, #124754) [Link]

> Which USB/WiFi 802.11ab/g/n dongles were used?

The exact chipsets we used in our tests are the Atheros AR9271 (ath9k_htc driver), the Atheros AR9580 (ath9k), and Atheros QCA9880v2 (ath10k). For space constraints, we presented here only the results coming from the ath9k_htc device, but we had a similar positive outcome with all the three chipsets.

TCP small queues and WiFi aggregation — a war story

Posted Jun 19, 2018 8:04 UTC (Tue) by shiftee (subscriber, #110711) [Link]

Excellent article with a nice positive outcome.

Dongles based on this chipset are available
https://www.thinkpenguin.com/gnu-linux/penguin-wireless-n...
and
https://www.olimex.com/Products/USB-Modules/MOD-WIFI-AR92...

If I remember correctly there was a FOSS enthusiast working for Atheros who convinced them to release the firmware code but he has since left the company

TCP small queues and WiFi aggregation — a war story

Posted Jun 19, 2018 13:29 UTC (Tue) by johan (guest, #112044) [Link]

I wish the kernel devs had better testing (in this case regression testing).
If they had they would likely have found issues like these a bit earlier.
Obviously it's a very hard task though considering how many devices there are to test.

TCP small queues and WiFi aggregation — a war story

Posted Jun 19, 2018 17:50 UTC (Tue) by josh (subscriber, #17465) [Link] (1 responses)

> The nominal transfer rate of the dongles is 150Mb/s, but what we saw on the screen was disappointing: an upload iperf connection, no matter which options were used, was able to reach only 40Mb/s. Using another operating system as a client, we were able to achieve 90Mb/s, leaving out a problem with the server. [...] To stress-test the equipment, we started a UDP transmission at a ludicrous speed. Not so surprisingly, we arrived almost at 100Mb/s.

I'd be curious how fast UDP on the other operating system went, to know if it topped out at the same 100Mb/s.

TCP small queues and WiFi aggregation — a war story

Posted Jun 22, 2018 8:27 UTC (Fri) by cagrazia (guest, #124754) [Link]

> I'd be curious how fast UDP on the other operating system went, to know if it topped out at the same 100Mb/s.

We did a very quick TCP and UDP test on a Windows 8 machine, with a kind of iperf tool (honestly, it was tricky and messy to configure): throughput results were oscillating between 90 and 95 Mbps, while the TCP RTT was very unstable as well as the UDP jitter.

TCP small queues and WiFi aggregation — a war story

Posted Jun 21, 2018 14:47 UTC (Thu) by mtaht (subscriber, #11087) [Link]

From eric dumazet and toke on the bloat mailing list:

From eric dumazet
On 06/21/2018 02:22 AM, Toke Høiland-Jørgensen wrote:
> Dave Taht <dave.taht@gmail.com> writes:
>
>> Nice war story. I'm glad this last problem with the fq_codel wifi code
>> is solved
>
> This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
> that were running TCP on the local stack. Which would be mostly laptops,
> I guess...

Yes.

Also switching TCP stack to always GSO has been a major gain for wifi in my tests.

(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and not having
GSO is considerably inflating the truesize/payload ratio)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
tcp: switch to GSO being always on

I expect SACK compression to also give a nice boost to wifi.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
tcp: add SACK compression

Lastly I am working on adding ACK compression in TCP stack itself.

MTU

Posted Jun 23, 2018 18:16 UTC (Sat) by meuh (guest, #22042) [Link] (5 responses)

I asked myself why not increasing the IP MTU (Maximum Transmission Unit) instead of relying on WiFi Frame Aggregation behavior ? I guess the answer is IPv6: as IPv6 doesn't expect routers to fragment packets, a typical TCP connection over the Internet involve packets no larger than 1280 bytes. So the WiFi hardware has to handle such small packets, hence the Frame Aggregation feature.

MTU

Posted Jun 25, 2018 9:16 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

Nope; the answer is IEE 802.3 Ethernet. WiFi (IEEE 802.11) is designed to transparently interoperate with 802.3 Ethernets. The IEEE has declared that the Ethernet MTU is fixed at 1500 bytes[1]; this implies that WiFi per-frame MTUs are also fixed at 1500 bytes. Given that it is a hard requirement for WiFi that the frame MTU is no more than 1500 bytes, you need things like aggregation to get a decent speed.

If larger frames were permitted on 802.11, then you would not be able to bridge 802.11 with IEEE standard 802.3; while it's common to support jumbo frames on Ethernet, this is technically a non-standard extension, and IEEE standard 802.11 can't assume that any Ethernet it is connected to will permit jumbo frames.

[1] While the IEEE 802.3 MTU is 1500 bytes, they also now require all equipment to handle frames of up to 2000 bytes in total size, to allow for headers, checksums, VLAN tags etc. WiFi is similar - 2304 byte maximum MSDU frame, of which 1500 bytes maximum is user MTU, and the other 804 bytes are reserved for VLAN tags etc.

MTU

Posted Jun 25, 2018 16:01 UTC (Mon) by raven667 (subscriber, #5198) [Link] (3 responses)

That's what I first though about this but the issue is not that you can't have jumbo 9000 byte MTU on the local layer2 link, either Ethernet or WiFi (although I think WiFi standard only supports 1500), which would be negotiated between endpoints as part of the TCP MSS, it's that intervening links over the Internet at some point are likely to only permit 1500 byte frames, so somewhere along the line you would need to fragment the layer3 packets, which is not allowed for IPv6 which relies on PMTU discovery to find the correct MSS/MTU that can cleanly make it through the whole path, leading to needing aggregation on the first WiFi hop for efficiency, to make up for not requiring upstream routers to fragment/reassemble. You could probably sidestep this if you had a jumbo-clean path, and you weren't concerned about stations monopolizing airtime with large frames, but that's unlikely unless you control the whole infrastructure between endpoints.

MTU

Posted Jun 25, 2018 16:20 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Nope. The issue is that you cannot have an MTU above 1500 on Ethernet without breaking the IEEE specs for Ethernet and for WiFi. You are simply not allowed a jumbo MTU on the Layer 2 link, and the IEEE won't accept changes to 802 series standards that increase the user MTU beyond 1500.

IPv6 is not relevant here - it's an IEEE decision because even in IPv4, with router fragmentation allowed, the IEEE doesn't like it.

MTU

Posted Jun 25, 2018 17:28 UTC (Mon) by raven667 (subscriber, #5198) [Link] (1 responses)

That seems bogus, I'm pretty sure I have many pieces of equipment using jumbo frames, whatever IEEE has in their written specs or whatever they "like".

MTU

Posted Jun 25, 2018 17:40 UTC (Mon) by farnz (subscriber, #17727) [Link]

You do, but they're not using IEEE standard Ethernet (jumbo frames implies not IEEE standard) - and WiFi standards (including frame aggregation) are written to get high performance when using IEEE standard Ethernet.

Hence frame aggregation rather than high MTUs - a high MTU for performance means being outside the IEEE standard, while a 1500 MTU allows you to be inside the standard.