LWN: Comments on "Network transmit queue limits"

Network transmit queue limits

shentino — Mon, 12 Aug 2013 04:42:00 +0000

The purpose of the device queue is actually to maximize throughput by keeping the interface busy without having to pester the kernel for new packets.

Especially if the kernel is busy with something else and can't immediately handle an interrupt.

And since the queue is being digested by the hardware itself in many cases, the kernel can't just tinker with it willy nilly.

Network transmit queue limits

wtanksleyjr — Wed, 24 Aug 2011 15:03:24 +0000

It seems to me -- ignorance alert! -- that the problem isn't the bytes or the time at all; it's the variance. The purpose of a queue isn't to make a device send faster or slower; it's to cover up variance.

The sources of the variance will have to be considered carefully; variance caused by time delays on the output is probably different from that caused by multiple clients asynchronously loading data into the input.

Network transmit queue limits

butlerm — Wed, 17 Aug 2011 16:20:29 +0000

>I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice

It is worth keeping in mind that slow start is not very slow - it is a doubling of the congestion window (and hence average transmit bandwidth) every round trip time. If you don't have something like slow start a new connection tends to immediately saturate every bottleneck link, causing large scale packet loss not only on the new connection, but all the others using the link as well.

That puts all the (congestion controlled) flows on the link into some sort of recovery mode, which is generally much slower than slow start is in the first place - a constant increase every RTT rather than a multiplicative one.

It works, the flows do sort themselves out, but it isn't very friendly, and usually doesn't even help the new connection. That is why they adopted "slow" start in the first place. It replaced previous practice of saturating the outbound link until some sort of loss indication was received. Running a gigabit per second flow into a ten megabit per second link doesn't work all that well.

Network transmit queue limits

butlerm — Wed, 17 Aug 2011 16:02:20 +0000

I wouldn't worry too much about an initial congestion window of ten packets. On a five mbps bottleneck link with 1500 byte packets that is only about 2.4 ms of queuing delay. The queuing delay due to ack compression as the congestion window increases is probably going to be considerably higher than that.

There seems to me to be only two good ways to solve the queuing latency problem, beyond simply reducing queuing limits on bottleneck routers and interfaces to reasonable sizes. One is the widespread deployment of packet pacing, which is difficult to do well without hardware support, and which has other challenges. The other is fair (flow specific) queuing at every bottleneck router or interface. The latter seems much more practical to me.

Network transmit queue limits

nye — Wed, 17 Aug 2011 12:05:43 +0000

>I don't have my library handy, but I seem to recall that Tanenbaum discusses TCP congestion control at length. I'm sure you'll find something good in Stevens too.

Thanks for the reference. I don't know Stevens - I assume you're talking about TCP/IP illustrated? I notice there's a second edition due out later this year. Sadly not in paperback though; can't stand hardbacks so I'll probably give it a miss.

>> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"

> Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.

Sorry, I should have said "the number of segments that can be transmitted in the *minimum* RTT", and then only as the *initial* cwnd. The thinking being that it can't possibly have received an ACK yet, so the fact that it hasn't need not imply congestion. I haven't really thought through the implications of that in the case that the 3-way handshake is made under highly congested conditions though, giving a vastly inaccurate bound for the min-RTT

>I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows

This is interesting, from the point of view of how we're predominantly using a protocol for something a little out of its design parameters.

(I was going to go off on a tangent here about using TCP/IP in circumstances which break its design assumptions, like bufferbloat and highly asymmetrical connections, but I need to think about it some more)

Network transmit queue limits

jch — Tue, 16 Aug 2011 22:19:30 +0000

> If anyone knows of any resources which explain this problem from "first principles"

I don't have my library handy, but I seem to recall that Tanenbaum discusses TCP congestion control at length. I'm sure you'll find something good in Stevens too.

> I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.

I'll make the bold claim that nobody fully understands the dynamics of TCP.

I wasn't there (so I'm probably wrong), but I believe that slow-start was designed as a fairly naive mechanism because it was not supposed to matter much in practice. TCP connections were supposed to be either long-lived bulk transfers (FTP, say), or interactive flows (telnet, or the conversational phase of SMTP). In the first case, slow-start only happens at the beginning of the transfer, which is a negligible part of the connection, while in the second case the size of the congestion window doesn't matter.

The trouble is with HTTP, which causes a lot of short-lived connections. Such a short-lived connection spends most or all of its life in slow-start. Hence the need for sharing state between different connections (which Linux does AFAIR) or tweaking the initial window.

> since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT"

Recall that the congestion window is there to limit congestion: it should decrease as congestion increases. With typical queueing techniques, the RTT increases with congestion, so what you are suggesting has the opposite of the desired dynamics.

Yeah, it's tricky. No, I don't claim to understand the trade-offs involved.

--jch

Network transmit queue limits

jch — Tue, 16 Aug 2011 21:45:06 +0000

It does cause more packets to be queued, which increases queue length and hence network-layer latency. OTOH, it does cause packets to be sent more faster, which I guess can be described as reducing application-layer latency (the time needed to load a page).

That's just the kind of tricky trade-off that the bufferbloat project is struggling with.

--jch

Network transmit queue limits

nye — Mon, 15 Aug 2011 14:22:03 +0000

(Please excuse the naivety of this question)

I can't wrap my head around slow-start, probably because I don't think I understand the problem it's intended to solve.

What I'm wondering is: since the sender already has an upper bound for the min-RTT, why is the initial congestion window set to a fixed number rather than to "the number of segments that can be transmitted in the RTT" (or the receiver's advertised window if smaller)?

I guess this wouldn't work for high-latency congested links since the initial window is IIUC used as the *minimum* window to fall back to when congestion occurs, but why does that need to be the case? I suspect the answer to this question may be along the lines of "that's the point of slow-start", but it's not intuitive to me.

If anyone knows of any resources which explain this problem from "first principles" - ie. without requiring the reader to already have more than a passing familiarity with TCP - I'd appreciate a pointer.

Network transmit queue limits

corbet — Mon, 15 Aug 2011 13:45:11 +0000

You're talking about the congestion window change? That's very much about latency. It lets pages load more quickly without the need to open lots of independent connections; the associated documentation is very clear on the motivation.

Network transmit queue limits

jch — Mon, 15 Aug 2011 11:43:10 +0000

> So it is not surprising that we have seen various latency-reducing changes from Google, including the increase in the initial congestion window

This doesn't decrease latency -- it increases throughput for short-lived connections ("mice"). Quite the opposite, in underprovisioned networks with a lot of mice it could increase latency dramatically.

--jch

Initial congestion window

butlerm — Sun, 14 Aug 2011 00:11:00 +0000

Sorry for creating any confusion. I see on git.kernel.org that both patches have made it in, which is good news. However, I believe that the increase to the initial congestion window is still a draft, not an RFC.

In the wild

dmarti — Sat, 13 Aug 2011 14:51:05 +0000

If you're using Google or Microsoft web sites, you're probably also testing this: Google and Microsoft Cheat on Slow-Start. Should You?

Initial congestion window

corbet — Sat, 13 Aug 2011 14:18:14 +0000

No, it's the initial congestion window; I'm not quite sure where this comes from. And yes it went through a long process with the IETF first.

Network transmit queue limits

butlerm — Sat, 13 Aug 2011 07:40:11 +0000

Getting the time accurate to microseconds can be a rather expensive operation, unfortunately, and that weighs against regulating queue lengths in terms of time when a simple proxy like bytes is available.

Network transmit queue limits

butlerm — Sat, 13 Aug 2011 07:36:43 +0000

According to the linked article, the patch which was merged in 2.6.38 increases the initial receive window, not the initial congestion window. A patch increasing the initial congestion window would be the sort of thing the IETF would frown upon - without their blessing, of course.

Network transmit queue limits

dlang — Sat, 13 Aug 2011 05:27:26 +0000

the key thing is that if the delay in transmitting is going to be too long, you want to be able to have the upper layers return an error rather than leaving the data in the queue.

Network transmit queue limits

sfink — Sat, 13 Aug 2011 05:15:10 +0000

This may very well be the right solution, but it seems less obvious than the text of this article would imply. Rather than dynamically adjusting the network device queue length, it seems like you'd really want to keep the device queue at short as possible without getting underruns, and feed it with a much larger priority queue of per-connection queues controlled by the kernel -- one which is lockless and served by a very high priority realtime thread.

But I don't know anything about what's involved so this probably isn't a realistic solution.

Network transmit queue limits

ajb — Fri, 12 Aug 2011 17:25:12 +0000

I was thinking of something along the lines of:

void q_add(Q *q,PKT *pkt)
{
   // timestamp packet
   pkt->time=now(); 

   // add packet to end of list
   *q->last=pkt;    
   q->last=&pkt->next;
}

PKT *q_get(Q *q)
{
   PKT *pkt=q->first;
   if((pkt->time+q->max_time) < now())
   {
      free(pkt);
      return 0;
   }
   else
   {
      return pkt;
   }
}

No estimation at all. There are weaknesses in this approach, but it's simpler than adjusting a byte length.

Network transmit queue limits

dlang — Fri, 12 Aug 2011 16:55:45 +0000

time is _much_ harder to estimate and measure than bytes.

if you have a full-duplex connection (i.e. hard-wired ethernet on modern switches), bytes and time have a very close correlation.

if you are on a shared media connection (unfortunantly including all radio based systems), then the correlation is not as close due to the fact that you can't know ahead of time how long it will take to send the data (you have to wait for other systems, retry, etc)

I think bytes is as accurate as you are going to be able to get.

Network transmit queue limits

ajb — Fri, 12 Aug 2011 12:09:26 +0000

I wonder if it wouldn't work better to define the queue length in microseconds, rather than bytes. That seems to be what this mechanism is approximating.