Network transmit queue limits
One assumes that the number of Warcraft players on the Google campus is relatively small, but Google worries about latency anyway. Anything that slows down response makes Google's services slower and less attractive. So it is not surprising that we have seen various latency-reducing changes from Google, including the increase in the initial congestion window merged for 2.6.38. A more recent patch from Google's Tom Herbert attacks latency caused by excessive buffering, but its future in its current form is uncertain.
An outgoing packet may pass through several layers of buffering before it hits the wire for even the first hop. There may be queues within the originating application, in the network protocol code, in the traffic control policy layers, in the device driver, and in the device itself - and probably in several other places as well. A full solution to the buffering problem will likely require addressing all of these issues, but each layer will have its own concerns and will be a unique problem to solve. Tom's patch is aimed at the last step in the system - buffering within the device's internal transmit queue.
Any worthwhile network interface will support a ring of descriptors describing packets which are waiting to be transmitted. If the interface is busy, there should always be some packets buffered there; once the transmission of one packet is complete, the interface should be able to begin the next one without waiting for the kernel to respond. It makes little sense, though, to buffer more packets in the device than is necessary to keep the transmitter busy; anything more than that will just add latency. Thus far, little thought has gone into how big that buffer should be; the default is often too large. On your editor's system, ethtool says that the length of the transmit ring is 256 packets; on a 1G Ethernet, with 1500-byte packets, that ring would take almost 4ms to transmit completely. 4ms is a fair amount of latency to add to a local transmission, and it's only one of several possible sources of latency. It may well make sense to make that buffer smaller.
The problem, of course, is that the ideal buffer size varies considerably from one system - and one workload - to the next. A lightly-loaded system sending large packets can get by with a small number of buffered packets. If the system is heavily loaded, more time may pass before the transmit queue can be refilled, so that queue should be larger. If the packets being transmitted are small, it will be necessary to buffer more of them. A few moments spent thinking about the problem will make it clear that (1) the number of packets is the wrong parameter to use for the size of the queue, and (2) the queue length must be a dynamic parameter that responds to the current load on the system. Expecting system administrators to tweak transmit queue lengths manually seems like a losing strategy.
Tom's patch adds a new "dynamic queue limits" (DQL) library that is meant to be a general-purpose queue length controller; on top of that he builds the "byte queue limits" mechanism used within the networking layer. One of the key observations is that the limit should be expressed in bytes rather than packets, since the number of queued bytes more accurately approximates the time required to empty the queue. To use this code, drivers must, when queueing packets to the interface, make a call to one of:
void netdev_sent_queue(struct net_device *dev, unsigned int pkts, unsigned int bytes); void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int pkts, unsigned int bytes);
Either of these functions will note that the given number of bytes have been queued to the given device. If the underlying DQL code determines that the queue is long enough after adding these bytes, it will tell the upper layers to pass no more data to the device for now.
When a transmission completes, the driver should call one of:
void netdev_completed_queue(struct net_device *dev, unsigned pkts, unsigned bytes); void netdev_tx_completed_queue(struct netdev_queue *dev_queue, unsigned pkts, unsigned bytes);
The DQL library will respond by reenabling the flow of packets into the driver if the length of the queue has fallen far enough.
In the completion routine, the DQL code also occasionally tries to adjust the queue length for optimal performance. If the queue becomes empty while transmission has been turned off in the networking code, the queue is clearly too short - there was not time to get more packets into the stream before the transmitter came up dry. On the other hand, if the queue length never goes below a given number of bytes, the maximum length can probably be reduced by up to that many bytes. Over time, it is hoped that this algorithm will settle on a reasonable length and that it will be able to respond if the situation changes and a different length is called for.
The idea behind this patch makes sense, so nobody spoke out against it.
Stephen Hemminger did express concerns about the need to add explicit calls
to drivers to make it all work, though. The API for network drivers is
already complex; he would like to avoid making it more so if possible.
Stephen thinks that it should be possible to watch traffic flowing through
the device at the higher levels and control the queue length without any
knowledge or cooperation from the driver at all; Tom is not yet convinced
that this will work. It will probably take some time to figure out what
the best solution is, and the code could end up changing significantly
before we see dynamic transmit queue length control get into the mainline.
Index entries for this article | |
---|---|
Kernel | Networking |
Posted Aug 12, 2011 12:09 UTC (Fri)
by ajb (guest, #9694)
[Link] (5 responses)
Posted Aug 12, 2011 16:55 UTC (Fri)
by dlang (guest, #313)
[Link] (4 responses)
if you have a full-duplex connection (i.e. hard-wired ethernet on modern switches), bytes and time have a very close correlation.
if you are on a shared media connection (unfortunantly including all radio based systems), then the correlation is not as close due to the fact that you can't know ahead of time how long it will take to send the data (you have to wait for other systems, retry, etc)
I think bytes is as accurate as you are going to be able to get.
Posted Aug 12, 2011 17:25 UTC (Fri)
by ajb (guest, #9694)
[Link] (1 responses)
Posted Aug 13, 2011 7:40 UTC (Sat)
by butlerm (subscriber, #13312)
[Link]
Posted Aug 24, 2011 15:03 UTC (Wed)
by wtanksleyjr (subscriber, #74601)
[Link] (1 responses)
The sources of the variance will have to be considered carefully; variance caused by time delays on the output is probably different from that caused by multiple clients asynchronously loading data into the input.
Posted Aug 12, 2013 4:42 UTC (Mon)
by shentino (guest, #76459)
[Link]
Especially if the kernel is busy with something else and can't immediately handle an interrupt.
And since the queue is being digested by the hardware itself in many cases, the kernel can't just tinker with it willy nilly.
Posted Aug 13, 2011 5:15 UTC (Sat)
by sfink (guest, #6405)
[Link] (1 responses)
But I don't know anything about what's involved so this probably isn't a realistic solution.
Posted Aug 13, 2011 5:27 UTC (Sat)
by dlang (guest, #313)
[Link]
Posted Aug 13, 2011 7:36 UTC (Sat)
by butlerm (subscriber, #13312)
[Link] (7 responses)
Posted Aug 13, 2011 14:18 UTC (Sat)
by corbet (editor, #1)
[Link] (2 responses)
Posted Aug 13, 2011 14:51 UTC (Sat)
by dmarti (subscriber, #11625)
[Link]
Posted Aug 14, 2011 0:11 UTC (Sun)
by butlerm (subscriber, #13312)
[Link]
Posted Aug 15, 2011 14:22 UTC (Mon)
by nye (guest, #51576)
[Link] (3 responses)
I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.
What I'm wondering is: since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT" (or the receiver's advertised window if smaller)?
I guess this wouldn't work for high-latency congested links since the initial window is IIUC used as the *minimum* window to fall back to when congestion occurs, but why does that need to be the case? I suspect the answer to this question may be along the lines of "that's the point of slow-start", but it's not intuitive to me.
If anyone knows of any resources which explain this problem from "first principles" - ie. without requiring the reader to already have more than a passing familiarity with TCP - I'd appreciate a pointer.
Posted Aug 16, 2011 22:19 UTC (Tue)
by jch (guest, #51929)
[Link] (2 responses)
I don't have my library handy, but I seem to recall that Tanenbaum discusses TCP congestion control at length. I'm sure you'll find something good in Stevens too.
> I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.
I'll make the bold claim that nobody fully understands the dynamics of TCP.
I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows (telnet, or the conversational phase of SMTP). In the first case, slow-start only happens at the beginning of the transfer, which is a negligible part of the connection, while in the second case the size of the congestion window doesn't matter.
The trouble is with HTTP, which causes a lot of short-lived connections. Such a short-lived connection spends most or all of its life in slow-start. Hence the need for sharing state between different connections (which Linux does AFAIR) or tweaking the initial window.
> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"
Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.
Yeah, it's tricky. No, I don't claim to understand the trade-offs involved.
--jch
Posted Aug 17, 2011 12:05 UTC (Wed)
by nye (guest, #51576)
[Link]
Thanks for the reference. I don't know Stevens - I assume you're talking about TCP/IP illustrated? I notice there's a second edition due out later this year. Sadly not in paperback though; can't stand hardbacks so I'll probably give it a miss.
>> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"
> Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.
Sorry, I should have said "the number of segments that can be transmitted in the *minimum* RTT", and then only as the *initial* cwnd. The thinking being that it can't possibly have received an ACK yet, so the fact that it hasn't need not imply congestion. I haven't really thought through the implications of that in the case that the 3-way handshake is made under highly congested conditions though, giving a vastly inaccurate bound for the min-RTT
>I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows
This is interesting, from the point of view of how we're predominantly using a protocol for something a little out of its design parameters.
(I was going to go off on a tangent here about using TCP/IP in circumstances which break its design assumptions, like bufferbloat and highly asymmetrical connections, but I need to think about it some more)
Posted Aug 17, 2011 16:20 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
It is worth keeping in mind that slow start is not very slow - it is a doubling of the congestion window (and hence average transmit bandwidth) every round trip time. If you don't have something like slow start a new connection tends to immediately saturate every bottleneck link, causing large scale packet loss not only on the new connection, but all the others using the link as well.
That puts all the (congestion controlled) flows on the link into some sort of recovery mode, which is generally much slower than slow start is in the first place - a constant increase every RTT rather than a multiplicative one.
It works, the flows do sort themselves out, but it isn't very friendly, and usually doesn't even help the new connection. That is why they adopted "slow" start in the first place. It replaced previous practice of saturating the outbound link until some sort of loss indication was received. Running a gigabit per second flow into a ten megabit per second link doesn't work all that well.
Posted Aug 15, 2011 11:43 UTC (Mon)
by jch (guest, #51929)
[Link] (3 responses)
This doesn't decrease latency -- it increases throughput for short-lived connections ("mice"). Quite the opposite, in underprovisioned networks with a lot of mice it could increase latency dramatically.
--jch
Posted Aug 15, 2011 13:45 UTC (Mon)
by corbet (editor, #1)
[Link] (2 responses)
Posted Aug 16, 2011 21:45 UTC (Tue)
by jch (guest, #51929)
[Link] (1 responses)
That's just the kind of tricky trade-off that the bufferbloat project is struggling with.
--jch
Posted Aug 17, 2011 16:02 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
There seems to me to be only two good ways to solve the queuing latency problem, beyond simply reducing queuing limits on bottleneck routers and interfaces to reasonable sizes. One is the widespread deployment of packet pacing, which is difficult to do well without hardware support, and which has other challenges. The other is fair (flow specific) queuing at every bottleneck router or interface. The latter seems much more practical to me.
Network transmit queue limits
Network transmit queue limits
I was thinking of something along the lines of:
Network transmit queue limits
void q_add(Q *q,PKT *pkt)
{
// timestamp packet
pkt->time=now();
// add packet to end of list
*q->last=pkt;
q->last=&pkt->next;
}
PKT *q_get(Q *q)
{
PKT *pkt=q->first;
if((pkt->time+q->max_time) < now())
{
free(pkt);
return 0;
}
else
{
return pkt;
}
}
No estimation at all. There are weaknesses in this approach, but it's simpler than adjusting a byte length.
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
According to the linked article, the patch which was merged in 2.6.38 increases the initial receive window, not the initial congestion window. A patch increasing the initial congestion window would be the sort of thing the IETF would frown upon - without their blessing, of course.
Network transmit queue limits
No, it's the initial congestion window; I'm not quite sure where this comes from. And yes it went through a long process with the IETF first.
Initial congestion window
If you're using Google or Microsoft web sites, you're probably also testing this: Google and Microsoft Cheat on Slow-Start. Should You?
In the wild
Sorry for creating any confusion. I see on git.kernel.org that both patches have made it in, which is good news. However, I believe that the increase to the initial congestion window is still a draft, not an RFC.
Initial congestion window
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits
You're talking about the congestion window change? That's very much about latency. It lets pages load more quickly without the need to open lots of independent connections; the associated documentation is very clear on the motivation.
Network transmit queue limits
Network transmit queue limits
Network transmit queue limits