Network transmit queue limits

By Jonathan Corbet
August 9, 2011

Network performance depends heavily on buffering at almost every point in a packet's path. If the system wants to get full performance out of an interface, it must ensure that the next packet is ready to go as soon as the device is ready for it. But, as the developers working on bufferbloat have confirmed, excessive buffering can lead to problems of its own. One of the most annoying of those problems is latency; if an outgoing packet is placed at the end of a very long queue, it will not be going anywhere for a while. A classic example can be reproduced on almost any home network: start a large outbound file copy operation and listen to the loud complaints from the World of Warcraft player in the next room; it should be noted that not all parents see this behavior as a bad thing. But, in general, latency caused by excessive buffering is indeed worth fixing.

One assumes that the number of Warcraft players on the Google campus is relatively small, but Google worries about latency anyway. Anything that slows down response makes Google's services slower and less attractive. So it is not surprising that we have seen various latency-reducing changes from Google, including the increase in the initial congestion window merged for 2.6.38. A more recent patch from Google's Tom Herbert attacks latency caused by excessive buffering, but its future in its current form is uncertain.

An outgoing packet may pass through several layers of buffering before it hits the wire for even the first hop. There may be queues within the originating application, in the network protocol code, in the traffic control policy layers, in the device driver, and in the device itself - and probably in several other places as well. A full solution to the buffering problem will likely require addressing all of these issues, but each layer will have its own concerns and will be a unique problem to solve. Tom's patch is aimed at the last step in the system - buffering within the device's internal transmit queue.

Any worthwhile network interface will support a ring of descriptors describing packets which are waiting to be transmitted. If the interface is busy, there should always be some packets buffered there; once the transmission of one packet is complete, the interface should be able to begin the next one without waiting for the kernel to respond. It makes little sense, though, to buffer more packets in the device than is necessary to keep the transmitter busy; anything more than that will just add latency. Thus far, little thought has gone into how big that buffer should be; the default is often too large. On your editor's system, ethtool says that the length of the transmit ring is 256 packets; on a 1G Ethernet, with 1500-byte packets, that ring would take almost 4ms to transmit completely. 4ms is a fair amount of latency to add to a local transmission, and it's only one of several possible sources of latency. It may well make sense to make that buffer smaller.

The problem, of course, is that the ideal buffer size varies considerably from one system - and one workload - to the next. A lightly-loaded system sending large packets can get by with a small number of buffered packets. If the system is heavily loaded, more time may pass before the transmit queue can be refilled, so that queue should be larger. If the packets being transmitted are small, it will be necessary to buffer more of them. A few moments spent thinking about the problem will make it clear that (1) the number of packets is the wrong parameter to use for the size of the queue, and (2) the queue length must be a dynamic parameter that responds to the current load on the system. Expecting system administrators to tweak transmit queue lengths manually seems like a losing strategy.

Tom's patch adds a new "dynamic queue limits" (DQL) library that is meant to be a general-purpose queue length controller; on top of that he builds the "byte queue limits" mechanism used within the networking layer. One of the key observations is that the limit should be expressed in bytes rather than packets, since the number of queued bytes more accurately approximates the time required to empty the queue. To use this code, drivers must, when queueing packets to the interface, make a call to one of:

    void netdev_sent_queue(struct net_device *dev, unsigned int pkts, unsigned int bytes);
    void netdev_tx_sent_queue(struct netdev_queue *dev_queue, unsigned int pkts,
			      unsigned int bytes);

Either of these functions will note that the given number of bytes have been queued to the given device. If the underlying DQL code determines that the queue is long enough after adding these bytes, it will tell the upper layers to pass no more data to the device for now.

When a transmission completes, the driver should call one of:

    void netdev_completed_queue(struct net_device *dev, unsigned pkts, unsigned bytes);
    void netdev_tx_completed_queue(struct netdev_queue *dev_queue, unsigned pkts,
				   unsigned bytes);

The DQL library will respond by reenabling the flow of packets into the driver if the length of the queue has fallen far enough.

In the completion routine, the DQL code also occasionally tries to adjust the queue length for optimal performance. If the queue becomes empty while transmission has been turned off in the networking code, the queue is clearly too short - there was not time to get more packets into the stream before the transmitter came up dry. On the other hand, if the queue length never goes below a given number of bytes, the maximum length can probably be reduced by up to that many bytes. Over time, it is hoped that this algorithm will settle on a reasonable length and that it will be able to respond if the situation changes and a different length is called for.

The idea behind this patch makes sense, so nobody spoke out against it. Stephen Hemminger did express concerns about the need to add explicit calls to drivers to make it all work, though. The API for network drivers is already complex; he would like to avoid making it more so if possible. Stephen thinks that it should be possible to watch traffic flowing through the device at the higher levels and control the queue length without any knowledge or cooperation from the driver at all; Tom is not yet convinced that this will work. It will probably take some time to figure out what the best solution is, and the code could end up changing significantly before we see dynamic transmit queue length control get into the mainline.

Index entries for this article
Kernel	Networking

Network transmit queue limits

Posted Aug 12, 2011 12:09 UTC (Fri) by ajb (guest, #9694) [Link] (5 responses)

I wonder if it wouldn't work better to define the queue length in microseconds, rather than bytes. That seems to be what this mechanism is approximating.

Network transmit queue limits

Posted Aug 12, 2011 16:55 UTC (Fri) by dlang (guest, #313) [Link] (4 responses)

time is _much_ harder to estimate and measure than bytes.

if you have a full-duplex connection (i.e. hard-wired ethernet on modern switches), bytes and time have a very close correlation.

if you are on a shared media connection (unfortunantly including all radio based systems), then the correlation is not as close due to the fact that you can't know ahead of time how long it will take to send the data (you have to wait for other systems, retry, etc)

I think bytes is as accurate as you are going to be able to get.

Network transmit queue limits

Posted Aug 12, 2011 17:25 UTC (Fri) by ajb (guest, #9694) [Link] (1 responses)

I was thinking of something along the lines of:

void q_add(Q *q,PKT *pkt)
{
   // timestamp packet
   pkt->time=now(); 

   // add packet to end of list
   *q->last=pkt;    
   q->last=&pkt->next;
}

PKT *q_get(Q *q)
{
   PKT *pkt=q->first;
   if((pkt->time+q->max_time) < now())
   {
      free(pkt);
      return 0;
   }
   else
   {
      return pkt;
   }
}

No estimation at all. There are weaknesses in this approach, but it's simpler than adjusting a byte length.

Network transmit queue limits

Posted Aug 13, 2011 7:40 UTC (Sat) by butlerm (subscriber, #13312) [Link]

Getting the time accurate to microseconds can be a rather expensive operation, unfortunately, and that weighs against regulating queue lengths in terms of time when a simple proxy like bytes is available.

Network transmit queue limits

Posted Aug 24, 2011 15:03 UTC (Wed) by wtanksleyjr (subscriber, #74601) [Link] (1 responses)

It seems to me -- ignorance alert! -- that the problem isn't the bytes or the time at all; it's the variance. The purpose of a queue isn't to make a device send faster or slower; it's to cover up variance.

The sources of the variance will have to be considered carefully; variance caused by time delays on the output is probably different from that caused by multiple clients asynchronously loading data into the input.

Network transmit queue limits

Posted Aug 12, 2013 4:42 UTC (Mon) by shentino (guest, #76459) [Link]

The purpose of the device queue is actually to maximize throughput by keeping the interface busy without having to pester the kernel for new packets.

Especially if the kernel is busy with something else and can't immediately handle an interrupt.

And since the queue is being digested by the hardware itself in many cases, the kernel can't just tinker with it willy nilly.

Network transmit queue limits

Posted Aug 13, 2011 5:15 UTC (Sat) by sfink (guest, #6405) [Link] (1 responses)

This may very well be the right solution, but it seems less obvious than the text of this article would imply. Rather than dynamically adjusting the network device queue length, it seems like you'd really want to keep the device queue at short as possible without getting underruns, and feed it with a much larger priority queue of per-connection queues controlled by the kernel -- one which is lockless and served by a very high priority realtime thread.

But I don't know anything about what's involved so this probably isn't a realistic solution.

Network transmit queue limits

Posted Aug 13, 2011 5:27 UTC (Sat) by dlang (guest, #313) [Link]

the key thing is that if the delay in transmitting is going to be too long, you want to be able to have the upper layers return an error rather than leaving the data in the queue.

Network transmit queue limits

Posted Aug 13, 2011 7:36 UTC (Sat) by butlerm (subscriber, #13312) [Link] (7 responses)

According to the linked article, the patch which was merged in 2.6.38 increases the initial receive window, not the initial congestion window. A patch increasing the initial congestion window would be the sort of thing the IETF would frown upon - without their blessing, of course.

Initial congestion window

Posted Aug 13, 2011 14:18 UTC (Sat) by corbet (editor, #1) [Link] (2 responses)

No, it's the initial congestion window; I'm not quite sure where this comes from. And yes it went through a long process with the IETF first.

In the wild

Posted Aug 13, 2011 14:51 UTC (Sat) by dmarti (subscriber, #11625) [Link]

If you're using Google or Microsoft web sites, you're probably also testing this: Google and Microsoft Cheat on Slow-Start. Should You?

Initial congestion window

Posted Aug 14, 2011 0:11 UTC (Sun) by butlerm (subscriber, #13312) [Link]

Sorry for creating any confusion. I see on git.kernel.org that both patches have made it in, which is good news. However, I believe that the increase to the initial congestion window is still a draft, not an RFC.

Network transmit queue limits

Posted Aug 15, 2011 14:22 UTC (Mon) by nye (guest, #51576) [Link] (3 responses)

(Please excuse the naivety of this question)

I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.

What I'm wondering is: since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT" (or the receiver's advertised window if smaller)?

I guess this wouldn't work for high-latency congested links since the initial window is IIUC used as the *minimum* window to fall back to when congestion occurs, but why does that need to be the case? I suspect the answer to this question may be along the lines of "that's the point of slow-start", but it's not intuitive to me.

If anyone knows of any resources which explain this problem from "first principles" - ie. without requiring the reader to already have more than a passing familiarity with TCP - I'd appreciate a pointer.

Network transmit queue limits

Posted Aug 16, 2011 22:19 UTC (Tue) by jch (guest, #51929) [Link] (2 responses)

> If anyone knows of any resources which explain this problem from "first principles"

I don't have my library handy, but I seem to recall that Tanenbaum discusses TCP congestion control at length. I'm sure you'll find something good in Stevens too.

> I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.

I'll make the bold claim that nobody fully understands the dynamics of TCP.

I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows (telnet, or the conversational phase of SMTP). In the first case, slow-start only happens at the beginning of the transfer, which is a negligible part of the connection, while in the second case the size of the congestion window doesn't matter.

The trouble is with HTTP, which causes a lot of short-lived connections. Such a short-lived connection spends most or all of its life in slow-start. Hence the need for sharing state between different connections (which Linux does AFAIR) or tweaking the initial window.

> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"

Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.

Yeah, it's tricky. No, I don't claim to understand the trade-offs involved.

--jch

Network transmit queue limits

Posted Aug 17, 2011 12:05 UTC (Wed) by nye (guest, #51576) [Link]

>I don't have my library handy, but I seem to recall that Tanenbaum discusses TCP congestion control at length. I'm sure you'll find something good in Stevens too.

Thanks for the reference. I don't know Stevens - I assume you're talking about TCP/IP illustrated? I notice there's a second edition due out later this year. Sadly not in paperback though; can't stand hardbacks so I'll probably give it a miss.

>> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"

> Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.

Sorry, I should have said "the number of segments that can be transmitted in the *minimum* RTT", and then only as the *initial* cwnd. The thinking being that it can't possibly have received an ACK yet, so the fact that it hasn't need not imply congestion. I haven't really thought through the implications of that in the case that the 3-way handshake is made under highly congested conditions though, giving a vastly inaccurate bound for the min-RTT

>I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows

This is interesting, from the point of view of how we're predominantly using a protocol for something a little out of its design parameters.

(I was going to go off on a tangent here about using TCP/IP in circumstances which break its design assumptions, like bufferbloat and highly asymmetrical connections, but I need to think about it some more)

Network transmit queue limits

Posted Aug 17, 2011 16:20 UTC (Wed) by butlerm (subscriber, #13312) [Link]

>I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice

It is worth keeping in mind that slow start is not very slow - it is a doubling of the congestion window (and hence average transmit bandwidth) every round trip time. If you don't have something like slow start a new connection tends to immediately saturate every bottleneck link, causing large scale packet loss not only on the new connection, but all the others using the link as well.

That puts all the (congestion controlled) flows on the link into some sort of recovery mode, which is generally much slower than slow start is in the first place - a constant increase every RTT rather than a multiplicative one.

It works, the flows do sort themselves out, but it isn't very friendly, and usually doesn't even help the new connection. That is why they adopted "slow" start in the first place. It replaced previous practice of saturating the outbound link until some sort of loss indication was received. Running a gigabit per second flow into a ten megabit per second link doesn't work all that well.

Network transmit queue limits

Posted Aug 15, 2011 11:43 UTC (Mon) by jch (guest, #51929) [Link] (3 responses)

> So it is not surprising that we have seen various latency-reducing changes from Google, including the increase in the initial congestion window

This doesn't decrease latency -- it increases throughput for short-lived connections ("mice"). Quite the opposite, in underprovisioned networks with a lot of mice it could increase latency dramatically.

--jch

Network transmit queue limits

Posted Aug 15, 2011 13:45 UTC (Mon) by corbet (editor, #1) [Link] (2 responses)

You're talking about the congestion window change? That's very much about latency. It lets pages load more quickly without the need to open lots of independent connections; the associated documentation is very clear on the motivation.

Network transmit queue limits

Posted Aug 16, 2011 21:45 UTC (Tue) by jch (guest, #51929) [Link] (1 responses)

It does cause more packets to be queued, which increases queue length and hence network-layer latency. OTOH, it does cause packets to be sent more faster, which I guess can be described as reducing application-layer latency (the time needed to load a page).

That's just the kind of tricky trade-off that the bufferbloat project is struggling with.

--jch

Network transmit queue limits

Posted Aug 17, 2011 16:02 UTC (Wed) by butlerm (subscriber, #13312) [Link]

I wouldn't worry too much about an initial congestion window of ten packets. On a five mbps bottleneck link with 1500 byte packets that is only about 2.4 ms of queuing delay. The queuing delay due to ack compression as the congestion window increases is probably going to be considerably higher than that.

There seems to me to be only two good ways to solve the queuing latency problem, beyond simply reducing queuing limits on bottleneck routers and interfaces to reasonable sizes. One is the widespread deployment of packet pacing, which is difficult to do well without hardware support, and which has other challenges. The other is fair (flow specific) queuing at every bottleneck router or interface. The latter seems much more practical to me.