TCP small queues and WiFi aggregation — a war story
This article describes our findings that connected TCP small queues (TSQ) with the behavior of advanced WiFi protocols and, in the process, solved a throughput regression. The resulting patch is already in the mainline tree, so before continuing, please make sure your kernel is updated. Beyond the fix, it is delightful to travel through history to see how we discovered the problem, how it was tackled, and how it was patched.
The academic life is full of rewards; one of ours was the moment in which three USB/WiFi 802.11ab/g/n dongles arrived. We bought dongles with an Atheros chipset because both the software driver and the firmware are available and modifiable. We were using the ath9k_htc kernel module with the default configuration. We compiled the latest (at the time) available kernel (4.13.8), and then we started the access point to create an 802.11n network to build the core of our future testbed for vehicular communications.
We started some simple tests with ping and iperf to check the connectivity, the distribution of IP addresses, and our custom DNS, which translates the names of our services into IP addresses. The nominal transfer rate of the dongles is 150Mb/s, but what we saw on the screen was disappointing: an upload iperf connection, no matter which options were used, was able to reach only 40Mb/s. Using another operating system as a client, we were able to achieve 90Mb/s, leaving out a problem with the server. Even with the newer kernel release (4.14), we did not see anything in the kernel messages that would have been correlated with a hardware or a driver failure. To stress-test the equipment, we started a UDP transmission at a ludicrous speed. Not so surprisingly, we arrived almost at 100Mb/s. It was clear that the root of the problem was in the TCP module or its interactions with the queueing disciplines, so the journey began.
The next step involved the tc command. We started listing the default queueing discipline and modifying its parameters. By default, we were using the mq queuing discipline, which instantiated an FQ-Codel queuing discipline for each outgoing hardware queue. With another driver, such as ath9k, the entire queuing layer is bypassed and a custom version of it, without the possibility of tuning or modifying the queueing discipline, is deployed inside the kernel WiFi subsystem. With ath9k_htc driver, instead, we still had the chance to play with the queuing discipline type and parameters. We opted for using the most basic (but reliable) discipline, pfifo_fast. But nothing changed.
We were using the default CUBIC congestion-control module. Despite the recent hype around BBR, we decided to stick with CUBIC because it has always just worked and never betrayed us (until now, as it seems). Just for a try, we switched to BBR, but things got worse than before; the throughput dropped by 50%, never passing the 20Mb/s line. To do all the tests, we employed Flent, which also gives latency results. All the latencies were low; we never exceeded a couple of milliseconds of delay. In our experience, a low throughput with a low latency indicates a well-known problem: starvation. So the question became: what was limiting the number of segments transmitted by the client?
In 2012, with commit 46d3ceabd8d9, TCP small queues were introduced. Their initial objective was to prevent TCP sockets from queuing more than 128KB of data in the network stack. In 2013, the algorithm was updated to have a dynamic limit. Instead of the fixed value, the limit was defined as either two segment's worth of data or an amount of data that corresponds to a transmission time of 1ms at the current (guessed) transmission rate. The calculation of the transmission rate was added some months earlier, with the objective of calculating the proper sizing of segments when TCP segmentation offload is in use, along with the introduction of a packet scheduler (FQ) able to spread out the sent segments over an interval.
However, the first reports suggested that the amount of data queued was too low in some subsystems, such as WiFi. The reason behind this was the impossibility, for the WiFi driver, of performing frame aggregation, due to the lack of data in the driver's queue. The aggregation technique combines multiple packets into a single frame to reduce the constant overhead for over-the-air transmission. Preventing aggregation is a sure way to wreck throughput.
In response, a minimum amount of buffering (128KB) was restored in commit 98e09386c0ef4. One year later, a refactoring patch for segmentation offload sizing introduced a small modification that, as we will see, changed the situation dramatically. The 128KB value was changed from being a lower bound to an upper bound. If the amount of data queued was forced to be less than 128KB, what would happen to the WiFi aggregation?
Fast forwarding to the 4.14 kernel, we started to think how to tune these thresholds. First of all, the function that decides (even in recent kernels) how much TCP data is allowed to enter the network stack is tcp_small_queue_check():
static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb, unsigned int factor) { unsigned int limit; limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); limit = min_t(u32, limit, sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes); limit <<= factor; /* ... */ }
The limit is calculated as the maximum of two full-size segments and ~1ms of data at the current rate. Then the minimum of this value and the 128KB threshold is used (to be in sync with the kernel history, we must say the default value was raised to 256KB in 2015). We started to wonder what would happen if we patched out the possibility of setting a lower bound on the amount of data that could be enqueued. We then modified in the most obvious way the above function to get the following results:
The first column represents the results using the pre-fix parameters for TSQ (two segments or ~1ms of data at the current rate). In the second, we forced at least 64KB to be queued. As we can see, the throughput increased by 20Mb/s, but also the delay (even if the latency increase is not as pronounced as the throughput increase). Then we tested the original configuration in which the lower bound was 128KB; the throughput exceeds the 90Mb/s value, with an added latency of 2ms. It is enough to have 128KB of data queued to have the proper aggregation behavior, at least with our hardware. Increasing that value (we plotted up to 10MB) does not improve in any way the throughput, but it worsens the delay. Even the case in which the TSQ is entirely disabled did not add any improvement to the situation. We found the cause of the problem: a minimum value of data should be enqueued to ensure that frame aggregation works.
After the testing phase, we realized that putting back fixed byte values would be the wrong choice because, for slow flows, we would only have increased the latency. But, thanks to the modifications done to support BBR, we do know what the flow's current rate is: why not use it? In fact, in commit 3a9b76fd0db9f, pushed at the end of 2017, the logic of TSQ was enriched by the possibility, for a device driver, to increase the number of milliseconds worth of data that can be queued. The best value for throughput that worked in all the hardware we tested was in between 4-8ms at the flow rate. So, we shared our results, and some weeks later a patch was accepted. In your latest kernel, thanks to commit 36148c2bbfbe, your WiFi driver can allow TCP to queue enough data to solve the aggregation problem with a negligible impact on latency.
The networking stack is complicated (what is simple in kernel space?). For sure, it is not an opaque black box, but instead, it is an orchestrated set of different pieces of knowledge, reflected into layers that can, sometimes, make incompatible choices. As a lesson, we learned that the relationship between latency and throughput on different technologies is not the same, and aggregation in wireless technologies is more common than we initially thought. Moreover, as a community, we should start thinking about automated tests that can give an idea of the performance impact of a patch under different technologies and in a wide range of contexts, from the 40Gb/s device of a burdened server to the 802.11ab/g/n USB dongle connected to a Raspberry Pi.
[The authors would like to thank Toke Høiland-Jørgensen for his support
and the time he dedicated to the Flent tool, to the WiFi drivers, and to
gather the results from the ath9k and ath10k drivers.]
Index entries for this article | |
---|---|
Kernel | Networking/Wireless |
GuestArticles | Augusto Grazia, Carlo |
Posted Jun 19, 2018 3:35 UTC (Tue)
by pabs (subscriber, #43278)
[Link] (7 responses)
Posted Jun 19, 2018 6:56 UTC (Tue)
by Beolach (guest, #77384)
[Link] (5 responses)
Sadly, to the best of my knowledge there are no 802.11ac chipsets w/ both open-source drivers & firmware - I believe even the ath10k 802.11ac chipsets have closed firmware blobs. :-( I would love to hear about any I've missed.
Posted Jun 21, 2018 15:59 UTC (Thu)
by mtaht (subscriber, #11087)
[Link] (4 responses)
but wifi has some really hard real-time constraints that require dedicated cpus onboard, and once you start doing that the temptation to wedge all your functionality there has thus far been overwhelming for vendors.
So far as I know qualcom's 802.11ac devices use a proprietary R/T OS inside.
Some chipsets (like quantenna's) actually wedge an entire linux stack into their chip.
qualcomms's LTE modems do also.
Posted Jun 21, 2018 16:43 UTC (Thu)
by excors (subscriber, #95769)
[Link] (1 responses)
And portability - you don't want to maintain multiple separate copies of your million lines of driver code (plus regression tests etc) for Linux, Windows, Apple, Fuchsia, the several obsolete Linux versions your customers still use, etc, when it's much easier to put almost all the code in firmware so the OS-specific driver is just a thin wrapper. Plus the Linux community will likely be much happier if you upstream that wrapper and leave the firmware opaque, than if you attempt to upstream your million lines of cross-platform-ish code that doesn't follow the Linux coding style and has an ugly abstraction layer for OS-specific bits.
None of that stops you making the thick firmware open source, though.
Posted Jun 21, 2018 19:05 UTC (Thu)
by mtaht (subscriber, #11087)
[Link]
If I had any one wish for "smart firmware", it would be only that there was enough smarts in the wifi and lte hardware/firmware to handle no more than 4ms worth of queuing and real time processing, and let the kernel handle the rest within its constraints for interrupt latency.
BQL accomplished this for ethernet (sub-1ms there, actually)
fq_codel for wifi ( https://www.usenix.org/system/files/conference/atc17/atc1... ) gets it down to two aggregates, which can take up to ~5ms each, at any achieved "line" rate.
We can do better than this with better control of txops on the AP, and certainly the algorithms above can exist, offloaded, in smarter hardware, which is happening on several chipsets I'm aware of.
PS The new sch_cake actually can run ethernet, shaped to 1gbit, at lower latencies than anything that uses BQL - at a large cost in cpu overhead, but not as much as you might think - it works at that rate on a quad core atom, for example.
http://www.taht.net/~d/cake/rrul_be_-_cake-shaped-gbit-qu...
vs sch_fq:
http://www.taht.net/~d/cake/rrul_be_-_fq-quad-long-smooth...
The principal advantage of cake, here, even though now capable running at speeds and latencies like this, is to defeat other black box token bucket shapers on a link, however, at much lower rates, with a corresponding reduction in cpu cost to (at sub-100mbit) the level of "noise".
Posted Jun 22, 2018 9:52 UTC (Fri)
by kronat (guest, #117266)
[Link] (1 responses)
Do you have a reference for these statements? I am investigating a similar problem in 3GPP networks. Thanks!
Posted Jun 22, 2018 23:40 UTC (Fri)
by mtaht (subscriber, #11087)
[Link]
But: https://osmocom.org/projects/quectel-modems/wiki
and the slides from that talk: https://fahrplan.events.ccc.de/congress/2016/Fahrplan/sys...
Posted Jun 22, 2018 8:12 UTC (Fri)
by cagrazia (guest, #124754)
[Link]
The exact chipsets we used in our tests are the Atheros AR9271 (ath9k_htc driver), the Atheros AR9580 (ath9k), and Atheros QCA9880v2 (ath10k). For space constraints, we presented here only the results coming from the ath9k_htc device, but we had a similar positive outcome with all the three chipsets.
Posted Jun 19, 2018 8:04 UTC (Tue)
by shiftee (subscriber, #110711)
[Link]
Dongles based on this chipset are available
If I remember correctly there was a FOSS enthusiast working for Atheros who convinced them to release the firmware code but he has since left the company
Posted Jun 19, 2018 13:29 UTC (Tue)
by johan (guest, #112044)
[Link]
Posted Jun 19, 2018 17:50 UTC (Tue)
by josh (subscriber, #17465)
[Link] (1 responses)
I'd be curious how fast UDP on the other operating system went, to know if it topped out at the same 100Mb/s.
Posted Jun 22, 2018 8:27 UTC (Fri)
by cagrazia (guest, #124754)
[Link]
We did a very quick TCP and UDP test on a Windows 8 machine, with a kind of iperf tool (honestly, it was tricky and messy to configure): throughput results were oscillating between 90 and 95 Mbps, while the TCP RTT was very unstable as well as the UDP jitter.
Posted Jun 21, 2018 14:47 UTC (Thu)
by mtaht (subscriber, #11087)
[Link]
From eric dumazet
Yes.
Also switching TCP stack to always GSO has been a major gain for wifi in my tests.
(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and not having
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
I expect SACK compression to also give a nice boost to wifi.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
Lastly I am working on adding ACK compression in TCP stack itself.
Posted Jun 23, 2018 18:16 UTC (Sat)
by meuh (guest, #22042)
[Link] (5 responses)
Posted Jun 25, 2018 9:16 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (4 responses)
Nope; the answer is IEE 802.3 Ethernet. WiFi (IEEE 802.11) is designed to transparently interoperate with 802.3 Ethernets. The IEEE has declared that the Ethernet MTU is fixed at 1500 bytes[1]; this implies that WiFi per-frame MTUs are also fixed at 1500 bytes. Given that it is a hard requirement for WiFi that the frame MTU is no more than 1500 bytes, you need things like aggregation to get a decent speed.
If larger frames were permitted on 802.11, then you would not be able to bridge 802.11 with IEEE standard 802.3; while it's common to support jumbo frames on Ethernet, this is technically a non-standard extension, and IEEE standard 802.11 can't assume that any Ethernet it is connected to will permit jumbo frames.
[1] While the IEEE 802.3 MTU is 1500 bytes, they also now require all equipment to handle frames of up to 2000 bytes in total size, to allow for headers, checksums, VLAN tags etc. WiFi is similar - 2304 byte maximum MSDU frame, of which 1500 bytes maximum is user MTU, and the other 804 bytes are reserved for VLAN tags etc.
Posted Jun 25, 2018 16:01 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (3 responses)
Posted Jun 25, 2018 16:20 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (2 responses)
Nope. The issue is that you cannot have an MTU above 1500 on Ethernet without breaking the IEEE specs for Ethernet and for WiFi. You are simply not allowed a jumbo MTU on the Layer 2 link, and the IEEE won't accept changes to 802 series standards that increase the user MTU beyond 1500.
IPv6 is not relevant here - it's an IEEE decision because even in IPv4, with router fragmentation allowed, the IEEE doesn't like it.
Posted Jun 25, 2018 17:28 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (1 responses)
Posted Jun 25, 2018 17:40 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
You do, but they're not using IEEE standard Ethernet (jumbo frames implies not IEEE standard) - and WiFi standards (including frame aggregation) are written to get high performance when using IEEE standard Ethernet.
Hence frame aggregation rather than high MTUs - a high MTU for performance means being outside the IEEE standard, while a 1500 MTU allows you to be inside the standard.
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
cpu chip is multicore and much faster than anything built into the wifi chip.
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
> qualcomms's LTE modems do also.
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
https://www.thinkpenguin.com/gnu-linux/penguin-wireless-n...
and
https://www.olimex.com/Products/USB-Modules/MOD-WIFI-AR92...
TCP small queues and WiFi aggregation — a war story
If they had they would likely have found issues like these a bit earlier.
Obviously it's a very hard task though considering how many devices there are to test.
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
TCP small queues and WiFi aggregation — a war story
On 06/21/2018 02:22 AM, Toke Høiland-Jørgensen wrote:
> Dave Taht <dave.taht@gmail.com> writes:
>
>> Nice war story. I'm glad this last problem with the fq_codel wifi code
>> is solved
>
> This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
> that were running TCP on the local stack. Which would be mostly laptops,
> I guess...
GSO is considerably inflating the truesize/payload ratio)
tcp: switch to GSO being always on
tcp: add SACK compression
MTU
MTU
MTU
MTU
MTU
MTU