Van Jacobson's network channels

[Posted January 31, 2006 by corbet]

Your editor had the good fortune to see Van Jacobson speak at the 1989 USENIX conference. His talk covered some of the bleeding-edge topics of the time, including TCP slow start algorithms and congestion avoidance. It was the "how Van saved the net" talk (though he certainly did not put it in those terms), and, many years later, the impression from that talk remains. Van Jacobson is a smart guy.

Unfortunately, attending Van's talk at linux.conf.au this year was not in the program. Fortunately, David Miller was there and listening carefully. Van has figured out how the next round of networking performance improvements will happen, and he has the numbers to prove it. Expect some very interesting (and fundamental) changes in the Linux networking stack as Van's ideas are incorporated. This article attempts to cover the fundamentals of Van's scheme (called "channels") based on David's weblog entry and Van's slides [PDF].

Van, like many others, points out that the biggest impediment to scalability on contemporary hardware is memory performance. Current processors can often execute multiple instructions per nanosecond, but loading a cache line from memory still takes 50ns or more. So cache behavior will often be the dominant factor in the performance of kernel code. That is why simply making code smaller often makes it faster. The kernel developers understand cache behavior well, and much work has gone into improving cache utilization in the kernel.

The Linux networking stack (like all others) does a number of things which reduce cache performance, however. These include:

Passing network packets through multiple layers of the kernel. When a packet arrives, the network card's interrupt handler begins the task of feeding the packet to the kernel. The remainder of the work may well be performed at software interrupt level within the driver (in a tasklet, perhaps). The core network processing happens in another software interrupt. Copying the data (an expensive operation in itself) to the application happens in kernel context. Finally the application itself does something interesting with the data. The context changes are expensive, and if any of these changes causes the work to move from one CPU to another, a big cache penalty results. Much work has been done to improve CPU locality in the networking subsystem, but much remains to be done.
Locking is expensive. Taking a lock requires a cross-system atomic operation and moves a cache line between processors. Locking costs have led to the development of lock-free techniques like seqlocks and read-copy-update, but the the networking stack (like the rest of the kernel) remains full of locks.
The networking code makes extensive use of queues implemented with doubly-linked lists. These lists have poor cache behavior since they require each user to make changes (and thus move cache lines) in multiple places.

To demonstrate what can happen, Van ran some netperf tests on an instrumented kernel. On a single CPU system, processor utilization was 50%, of which 16% was in the socket code, 5% in the scheduler, and 1% in the application. On a two-processor system, utilization went to 77%, including 24% in the socket code and 12% in the scheduler. That is a worst case scenario in at least one way: the application and the interrupt handler were configured to run on different CPUs. Things will not always be that bad in the real world, but, as the number of processors increases, the chances of the interrupt handler running on the same processor as any given application decrease.

The key to better networking scalability, says Van, is to get rid of locking and shared data as much as possible, and to make sure that as much processing work as possible is done on the CPU where the application is running. It is, he says, simply the end-to-end principle in action yet again. This principle, which says that all of the intelligence in the network belongs at the ends of the connections, doesn't stop at the kernel. It should continue, pushing as much work as possible out of the core kernel and toward the actual applications.

The tool used to make this shift happen is the "net channel," intended to be a replacement for the socket buffers and queues used in the kernel now. Some details of how channels are implemented can be found in Van's slides, but all that really matters is the core concept: a channel is a carefully designed circular buffer. Properly done, circular buffers require no locks and share no writable cache lines between the producer and the consumer. So adding data to (or removing data from) a net channel will be a fast, cache-friendly operation.

As a first step, channels can be pushed into the driver interface. A network driver need no longer be aware of sk_buff structures and such; instead, it simply drops incoming packets into a channel as they are received. Making this change cuts the CPU utilization in the two-processor case back to 58%. But things need not stop there. A next logical step would be to get rid of the networking stack processing at softirq level and to feed packets directly into the socket code via a channel. Doing that requires creating a separate channel for each socket and adding a simple packet classifier so that the driver knows which channel should get each packet. The socket code must also be rewritten to do the protocol processing (using the existing kernel code). That change drops the overall CPU utilization to 28%, with the portion spent at softirq level dropping to zero.

But why stop there? If one wants to be serious about this end-to-end thing, one could connect the channel directly to the application. Said application gets the packet buffers mapped directly into its address space and performs protocol processing by way of a user-space library. This would be a huge change in how Linux does networking, but Van's results speak for themselves. Here is his table showing the percentage CPU utilization for each of the cases described above:

Total CPU Interrupt SoftIRQ Socket Locks Sched App.

1 CPU 50 7 11 16 8 5 1

2 CPUs 77 9 13 24 14 12 1

Driver channel 58 6 12 16 9 9 1

Socket channel 28 6 0 16 1 3 1

App. channel 14 6 0 0 0 2 5

	Total CPU	Interrupt	SoftIRQ	Socket	Locks	Sched	App.
1 CPU	50	7	11	16	8	5	1
2 CPUs	77	9	13	24	14	12	1
Driver channel	58	6	12	16	9	9	1
Socket channel	28	6	0	16	1	3	1
App. channel	14	6	0	0	0	2	5

The bottom line (literally) is this: processing time for the packet stream dropped to just over 25% of the previous single-CPU case, and less than 20% of the previous two-CPU behavior. Three layers of kernel code have been shorted out altogether, with the remaining work performed in the driver interrupt handler and the application itself. The test system running with the full application channel code was able to handle twice the network bandwidth as an unmodified system - with the processors idle most of the time.

Linux networking hackers have always been highly attentive to performance issues, so numbers like these are bound to get their attention. Beyond performance, however, this approach promises simpler drivers and a reasonably straightforward transition between the current stack and a future stack built around channels. A channel-based user-space interface will make it easy to create applications which can send and receive packets using any protocol. If Van's results hold together in a "real-world" implementation, the only remaining question would be: when will it be merged so the rest of us can use it?

Index entries for this article
Kernel	Networking/Channels

Small correction

Posted Jan 31, 2006 20:43 UTC (Tue) by imcdnzl (guest, #28899) [Link]

A small correction to your story (as I was fortunate enough to be there) - the 50% is 100% of 1 CPU on a 2 CPU system - he was showing the results of the network card driver and the stack being bound to the same CPU.

Van Jacobson's network channels

Posted Jan 31, 2006 21:23 UTC (Tue) by job (guest, #670) [Link] (3 responses)

Thank you for yet another enlightening article which explains a foreign
concepts so even mortals have at least a remote chance of understanding
them! The concept sounds very interesting and I really look forward to
seeing it in code.

Van Jacobson's network channels

Posted Jan 31, 2006 21:28 UTC (Tue) by cloose (guest, #5066) [Link]

Thank you for yet another enlightening article which explains a foreign concepts so even mortals have at least a remote chance of understanding them!

I totally agree. The just renewed subscription payed off again.

Thank you for this great article!

I concur on both points

Posted Feb 1, 2006 15:19 UTC (Wed) by Baylink (guest, #755) [Link]

And, just to hammer the point home, Jon: you're at your best (and most compensable) when you're doing original journalism. We pay you for your opinions, and your clarifications (like this one) of complicated topics.

Do more. :-)

Van Jacobson's network channels

Posted Feb 9, 2006 9:05 UTC (Thu) by burki99 (subscriber, #17149) [Link]

After reading this article I immediately though: Wow, maybe I should finally subscribe after years of reading LWN a week delayed. This is the kind of article that differentiates this publication from the hundreds of other publications that just grabb a quote from LKML (Linus says no to GPL3) and add no insight at all.

Van Jacobson's network channels

Posted Jan 31, 2006 22:02 UTC (Tue) by ernest (guest, #2355) [Link] (10 responses)

Interesting, but where do things like iptable and other network security
fall in if the kernel doesn't do anything anymore with network packets ?

up to now that part fell into the Socket interface (I think). I can't
beleive that last remaining 0% cpu still contains iptable handling.

can IP security be delegated to userspace ?

ernest.

Van Jacobson's network channels and Netfilter

Posted Jan 31, 2006 22:29 UTC (Tue) by csamuel (✭ supporter ✭, #2624) [Link]

I was fortunate enough to be both at the original presentation and when he repeated it for the "best of" stream at the end of LCA2006 and got to ask him about what the situation was with netfilter.

His comment was that there was no reason why netfilter couldn't become just another consumer of packets, and my take on that is that whilst that would require patching to the netfilter code that too could be a good thing if it eliminates the use of double-linked lists (the use of which would cause you to fail CS 101 under VJ according to him :-)) but would probably be a lot of code.

Of course this is only necessary if you go further than chanellising the drivers themselves as one of the really elegant things is that this change of thinking is that its very modular - you can convert drivers one at a time until they're all done, then start on looking at channelising the socket layer and then start on the consumers of the socket layer.

There are 2 nice things about having the TCP stack running in user space, one is that it allows you to easily experiment and debug TCP issues and have custom behaviour for different applications based on need, and the second is that VJ explained the only reason it had to go into the kernel in Multics in the first place was that if a user process got pages out there it could take 2 minutes to get paged in, which TCP/IP doesn't like.. :-)

Chris

Van Jacobson's network channels

Posted Feb 1, 2006 11:00 UTC (Wed) by james (subscriber, #1325) [Link] (5 responses)

Presumably you can do a lot of security when you set up the channels. It looks like the packet classifier:

reads the protocol, ports, and addresses to determine the flow ID and uses this to find a channel

(Dave Miller's blog).

That looks like it's enough for most firewalling: it should give you pass (existing channel), fail (no channel), or needs more work (channel to netfilter).

Van Jacobson's network channels

Posted Feb 1, 2006 20:00 UTC (Wed) by NAR (subscriber, #1313) [Link] (4 responses)

I'm not sure I fully understand this, but it seems that these channels are used when there is a socket to the user space, i.e. an application running on the host sends/receives data to/from the network. But what about the case when there's no application? As far as I know, in routers the IP packets usually don't get to user space, but if protocol processing is moved to user space (netfilter), it might degrade performance, mightn't it?

Bye,NAR

Van Jacobson's network channels

Posted Feb 2, 2006 5:45 UTC (Thu) by xoddam (guest, #2322) [Link] (2 responses)

The phased implementation described only moves packet processing to
userspace at the very last stage. At the 'ends' of the network this is
appropriate for efficiency. But even before that stage, channels are a
better way to pass packets around within the kernel. The task-oriented
interface (using wakeups instead of soft interrupts) would probably mean
netfilter no longer runs in tasklet context. We might instead see
several netfilter kernelspace daemons (like kswapd and friends), one for
each CPU.

Van Jacobson's network channels

Posted Feb 2, 2006 9:57 UTC (Thu) by NAR (subscriber, #1313) [Link] (1 responses)

Wouldn't it lead to code duplication? For example, a box doing NAT would need a (limited?) TCP/IP implementation in kernel space, while the host running e.g. an FTP client would need the full TCP/IP implementation in user space.

Bye,NAR

Van Jacobson's network channels

Posted Feb 2, 2006 23:56 UTC (Thu) by xoddam (guest, #2322) [Link]

> Wouldn't it lead to code duplication?

Yes. So does inlining :-)

Van Jacobson's network channels

Posted Feb 2, 2006 21:46 UTC (Thu) by iabervon (subscriber, #722) [Link]

I don't see any reason that all of the channels would have to go to userspace. If a packet is to a kernel NFS client, it would end up in the kernel code, but without all the copies between the network and the VFS.

Of course, the kernel would have to keep a TCP implementation, but that's not surprising, since static binaries that use sockets should continue to work.

Van Jacobson's network channels

Posted Feb 2, 2006 12:29 UTC (Thu) by samj (guest, #7135) [Link] (1 responses)

If it doesn't cost anything, why not? You'd just plug netfilter in before the app and map packet buffers into its address space first. This can all be done in a separate security context too. Looks to me like it would mean a lot less in the way of protocol specific handling and would allow you to chain such tasks easily (eg netfilter->ipsec or netfilter->reverse proxy->web server etc.).

Van Jacobson's network channels

Posted Feb 2, 2006 20:12 UTC (Thu) by jonabbey (guest, #2736) [Link]

Please, someone tell me they're not re-inventing STREAMS.

Van Jacobson's network channels

Posted Feb 2, 2006 20:53 UTC (Thu) by caitlinbestler (guest, #32532) [Link]

Connections can be channelized *after* they have
passed netfilter inspection.

nitpicking

Posted Jan 31, 2006 22:36 UTC (Tue) by xav (guest, #18536) [Link]

s/want to 77%/went to 77%/

Van Jacobson's network channels

Posted Jan 31, 2006 22:49 UTC (Tue) by csamuel (✭ supporter ✭, #2624) [Link] (2 responses)

A couple of other quick comments from my notes of the two instances of this talk done at LCA 2006:

0) VJ said that the talk was *not* about fixing the Linux TCP stack as "the Linux TCP stack isn't broken" - but just because something has always been done this way (SKB's) doesn't mean it is necessarily the best way.

1) VJ was asked about this code going into the kernel - his reply was that he would slap the GPL onto his new code for the drivers and the socket layer on the plane home but that the user layer TCP side may be a bit more difficult as he needs to get agreement from others.

2) The entire user level TCP stack was done as a LD_PRELOAD'ed library and hence no actual changes to applications are necessary, so people can experiment to their hearts content with tuning TCP application by application. Fancy an Apache with a different congestion control method to your OpenSSH clients and server ? Go for it..

3) This reduced the amount of code in the interrupt handler of the e1000 considerably (from ~700 lines down to ~300) and removes all SKB code hence simplifing drivers, which can only be a good thing.

4) The new channelised napi_poll() routine is generic, rather than the current device-dependant implementations.

All in all an excellent talk

Van Jacobson's network channels

Posted Feb 1, 2006 0:57 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Obviously in a full-userspace-TCP/IP implementation, the default implementation would go into something on a par with glibc. Giving everything that wants to do networking an LD_PRELOAD would hammer performance and give Ulrich Drepper an aneurysm ;)

Van Jacobson's network channels

Posted Feb 7, 2006 18:15 UTC (Tue) by arafel (guest, #18557) [Link]

Well, y'know, I'm sure Ulrich's not got enough to do these days... ')

Van Jacobson's network channels & 10 Gb/s ethernet

Posted Jan 31, 2006 22:54 UTC (Tue) by csamuel (✭ supporter ✭, #2624) [Link] (4 responses)

Two final comments - honest!

1) His test results for 10Gb/s ethernet were limited not by drivers, kernel or networking but by the memory bandwidth of the DDR333 chips in the system he was testing on!

He estimated that you would need at least DDR800 RAM to be able to have enough memory bandwidth to drive 10Gb/s at capacity.

2) VJ said anyone who tells you that you'll need a TOE is not telling the truth. OK - this was known already but could be handy for beating vendors around the head with.. :-)

Van Jacobson's network channels & 10 Gb/s ethernet

Posted Feb 1, 2006 2:17 UTC (Wed) by bk (guest, #25617) [Link] (1 responses)

What is a TOE?

TOE

Posted Feb 1, 2006 2:25 UTC (Wed) by corbet (editor, #1) [Link]

TOE = "TCP Offload Engine," the TCP protocol implemented in adapter firmware. See this Kernel Page article from last August for one Linux-based implementation and all the reasons why it didn't get merged.

Van Jacobson's network channels & 10 Gb/s ethernet

Posted Feb 2, 2006 5:35 UTC (Thu) by bos (guest, #6154) [Link] (1 responses)

Plenty of Linux networking gear can drive 10Gbps hardware at line rate, and I'm not even talking about fancy TOE hardware in all cases.

Not having been at the talk, I don't know what circumstances he was talking about (perhaps specifically TCP at 10Gbps?), so I'm not picking a nit with his assertion, just pointing out that something along those lines can be done without scads of memory bandwidth.

Van Jacobson's network channels & 10 Gb/s ethernet

Posted Feb 8, 2006 8:57 UTC (Wed) by csamuel (✭ supporter ✭, #2624) [Link]

Yes, this is about TCP, not just pushing datagrams out..

VERY interesting - but security implications to others?!?

Posted Jan 31, 2006 23:27 UTC (Tue) by dwheeler (guest, #1216) [Link] (16 responses)

This looks VERY interesting, and I expect that this WILL be implemented. I think this is (mostly) a very good idea.

But it appears to me that this has a dark side. Today, because the kernel assembles packets, then only trusted (root) programs can forge packets and create many kinds of funky attack packets. If user-level applications can create arbitrary packets, then ANYONE -- even untrusted applications -- can forge arbitrary packets and arbitrary attack packets.

Clearly, in some situations this wouldn't matter. But historically getting only "user privileges" limited what you could do, including having to give away your IP address and only being able to send certain kinds of packets. This gives a new weapon, not so much against the machine IMPLEMENTING the new approach, but against OTHER machines (whether or not they do so). Today, given only low privileges, you can't create funky packets (like Xmas tree ones) or total forgeries. Unless there are kernel-level checks or I misunderstand something, you CAN cause these problems.

Just imagine; a user-space app sends out a broadcast from 127.0.0.1, etc., etc. There's a LOT of mischief that's been limited to kernel-space programs before that this might expose.

I'd like to see an implementation automatically check the outgoing packets for certain properties as part of the kernel (e.g., valid sender IP and port address, etc.). But I fear that won't happen by default, because (1) that would take extra time, and (2) it only affects OTHER people. And I understand (though don't agree with) the other side of the coin: Yes, of course people who have root can send any packet. That's not my point. My point is that for a large, shared network to be useful, there needs to be a defense-in-depth so that attackers aren't automatically given the whole store when they get just a little privilege. This would kick away one of those mechanisms. Eek.

VERY interesting - but security implications to others?!?

Posted Feb 1, 2006 2:20 UTC (Wed) by elanthis (guest, #6227) [Link] (13 responses)

That whole "only root can do stuff to the network" reasoning is complete bunk.

Absolutely nothing stops a user from booting their workstation with a LiveCD that they have root access to. Or plugging in a different machine to the network. Or rebooting into single-user mode.

You cannot rely on a per-machine control like root access to protect your network. If you want to do that, you have to have some sort of encryption/signing on every network packet sent and physically lock down the end-user workstations so that they can't reboot into single-user mode or pop in a LiveCD or modify/replace the hard disk.

That's not the only meaning of that statement

Posted Feb 1, 2006 2:34 UTC (Wed) by Ross (guest, #4065) [Link] (8 responses)

I sure hope that no application which I run as a normal user is able to reboot the system into another operating system in order to use raw sockets and low port numbers.

If it could, that's what I'd call a gaping security hole.

protocol validity checks

Posted Feb 1, 2006 3:57 UTC (Wed) by xoddam (guest, #2322) [Link] (3 responses)

can imagine an internal 'firewall' inspecting the header of each packet traversing a channel from userspace to ensure the app has sufficient privilege to send it. A pipeline stage with negligible performance impact -- it wouldn't thrash the cache, and if it's in the kernel it would involve no extra context switches.

protocol validity checks

Posted Feb 1, 2006 6:38 UTC (Wed) by cventers (guest, #31465) [Link] (2 responses)

Yeah, since you're writing into mapped memory, the kernel can check it
out in place. And since there's no copy, it's going to be hanging out in
the cache when the check has to take place.

protocol validity checks

Posted Feb 3, 2006 4:15 UTC (Fri) by zblaxell (subscriber, #26385) [Link] (1 responses)

kernel can check it out in place...while the user, maybe on another CPU, switches a few bits just after the kernel check but before the network card picks up the data.

Sneaky indeed!

Posted Feb 3, 2006 5:48 UTC (Fri) by xoddam (guest, #2322) [Link]

Ok, freely mapped memory doesn't cut it then. I wonder what the
performance impact of changing packet buffers' page permissions would be,
relative to copying (and relative to keeping the TCP implementation in
kernel space)?

That's not the only meaning of that statement

Posted Feb 1, 2006 12:15 UTC (Wed) by smitty_one_each (subscriber, #28989) [Link] (3 responses)

I think parent's point was about a living user doing a cold boot with a live CD, and then comitting mischief, not about an existing application you're running somehow warm booting and doing that.
Clearly, unprotected ON/OFF switches and promiscuous BIOS boot settings can be a gaping security hole.

That's not the only meaning of that statement

Posted Feb 1, 2006 13:54 UTC (Wed) by Ross (guest, #4065) [Link] (2 responses)

Yes and my point what is equivalent with physical security is not the totality of the problem. Of course if someone has physical access to your network they can any packets they like on it. But unpriviledged processes running on your server don't have physical access, but in the same scenario they would have the same level of access.

That's not the only meaning of that statement

Posted Feb 2, 2006 4:05 UTC (Thu) by elanthis (guest, #6227) [Link] (1 responses)

And my point remains... what is that unprivileged process going to do that you couldn't do by plugging in a laptop or some other device onto the network?

If you are implicitly trusting every packet sent by some 'trusted' host (which, if it were truly trusted, would never be running any malicious code anyhow), or trusting anything running on port 1024 down, you're not running a very secure network at all.

There is no security at the IP level at all. If you want trust and security, you have to put it all in higher layers.

That's not the only meaning of that statement

Posted Feb 2, 2006 5:28 UTC (Thu) by Ross (guest, #4065) [Link]

If you're only point is that security shouldn't depend on the network not being compromised I agree. However malicious users with unfettered physical access are not at all equivalent to malicious processes running under unpriviledged ids and that anything which makes them equivalent is decreasing security. Does it matter for well designed programs? No. But unfortunately tons of commonly used software is not well designed. If you can't trust IPs, port numbers, etc. many things break down. If you can't trust a program a user downloaded you should worry, but your network is not automatically compromised unless there is something which can be exploited on the system.

Users don't always WANT to attack

Posted Feb 1, 2006 16:35 UTC (Wed) by dwheeler (guest, #1216) [Link]

Sure, if a user is malicious, they can boot the system into some OS where they're fully privileged and attack. Or just unplug the network, plug in their own laptop where they have all privileges, and attack.

That's not the point. Not all users are malicious. Sometimes they run programs that APPEAR to do one thing, but do another. Sometimes systems run servers (like web servers) that an attacker can somehow subvert. In THOSE cases, I'd like the system to still limit what the attacker can do, INCLUDING limits on how the attacker can attack other systems.

Sure, all systems should be invulnerable to all attackers. But they aren't. Anyone who's managed a big network knows how hard it is to keep EVERYTHING secure, ESPECIALLY since there are some vendors who do not release patches for KNOWN vulnerabilities (names withheld, but Google can help you find them rather quickly). So you really need defense-in-depth: you need to try to make it so that attackers have to break down MULTIPLE barriers to get the goods.

Limiting the network-level actions of unprivileged accounts is not the be-all of security. But it's one of the few mechanisms we CURRENTLY have deployed widely that slow the spread of attacks across a network. Diseases that spread rapidly are often unstoppable, because you just don't have enough time to react. Slowing the spread of a disease is key to countering it. Similarly, in the network world, slowing down attack vectors is also key to countering it.

I'd like to see that packets from untrusted user apps are still FORCED to obey certain limits on what they can send. You don't need a system-wide lock to do that kind of checking; after a call to the kernel, the memory could be mapped out and checked WITHOUT harming the cache lines of other systems. For most systems it'd just involve checking a few bytes... nothing expensive, and certainly taking less time than sending something down any network port.

VERY interesting - but security implications to others?!?

Posted Feb 1, 2006 20:08 UTC (Wed) by NAR (subscriber, #1313) [Link] (1 responses)

Absolutely nothing stops a user from booting their workstation with a LiveCD that they have root access to.

Except the fact that this user can be an unauthorized one who've just cracked into the system from an other continent using the latest bug in a PHP BBS and his processes are running as 'nobody' user. He'd have hard time putting a live CD into the computer, but still we really don't want him to send arbitrary packets into the network.

Bye,NAR

VERY interesting - but security implications to others?!?

Posted Feb 8, 2006 13:14 UTC (Wed) by jzbiciak (guest, #5246) [Link]

I wonder if you can still get most of the benefits of network channels if you limit their accessibility to special user IDs, and then require non-privileged applications to use cooperating threads--one privileged, one not--to send packets.

That way, the TCP/IP implementation can be stored away in a fixed implementation that root checks in on (and the kernel may even checksum at lauch time), but the processing still lives in userspace. It looks a little like the priv-sep that sshd uses.

Granted, with two cooperating threads, you get back to some of the context switching issues, but still it feels a little more flexible than keeping it in kernel space.

VERY interesting - but security implications to others?!?

Posted Feb 11, 2006 9:33 UTC (Sat) by efexis (guest, #26355) [Link]

If my memory serves me correctly, there are one or two servers out there on the internet that run linux. I know it's not very common, but some of them actually rent space/bandwidth/etc so that people can host their own websites, and they allow these users to run code (cgi scripts) or even have shell access via telnet/ssh.

You obviously don't see this practice taking off, but even so, I think if you told someone who offers shared hosting, not to bother protecting against ways unprivilidged users can cause havock, because somebody "can simple break into the datacenter with a boot CD or laptop"... you'd probably get an incredibly sarcastic response.

But no, you stick to talking about how to improve the performance of a 10Gig heavy loaded network interface --on a workstation-- where it'll really count.

VERY interesting - but security implications to others?!?

Posted Feb 2, 2006 4:11 UTC (Thu) by jamesh (guest, #1159) [Link]

There isn't much problem with this model on the receive end, assuming that the kernel correctly classifies packets and sends them to the right process.

As for sending, all that is necessary is a packet verifier, that makes sure the packet is appropriate for the given socket (which should be a lot simpler than a full TCP send implementation). If the packet shouldn't be getting sent from the socket, the kernel doesn't transmit it.

VERY interesting - but security implications to others?!?

Posted Feb 3, 2006 18:57 UTC (Fri) by caitlinbestler (guest, #32532) [Link]

The same filter rules that route inbound packets
can be used to validate outbound packets. You
simply do not accept packets from a channel if
the response packet would not be routed to
the matching channel.

So the privileged end of the channel can validate
that every packet on it is for a TCP connection
that is actually assigned to that channel.

Van Jacobson's network channels

Posted Feb 1, 2006 0:00 UTC (Wed) by busterb (subscriber, #560) [Link]

Arsenic was an earlier implementation of a similar idea with apparently
good results:

http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic/

reinvention?

Posted Feb 1, 2006 9:51 UTC (Wed) by etroup (guest, #21786) [Link] (4 responses)

So is anyone else reminded of System V Streams ?

reinvention?

Posted Feb 1, 2006 10:19 UTC (Wed) by ctg (guest, #3459) [Link] (3 responses)

No. Streams are almost antithetical to Channels.

reinvention?

Posted Feb 1, 2006 15:51 UTC (Wed) by macc (guest, #510) [Link] (2 responses)

I had the same feeling of similarity.

could you elaborate on the differences?

reinvention?

Posted Feb 1, 2006 16:39 UTC (Wed) by vmole (guest, #111) [Link] (1 responses)

Streams were about introducing lots of layers, each doing some sort of processing on the packets (or whatever), and each layer involving copying the data. VJ's channels are about getting the data into user space as quickly as possible, minimizing the actual processing and layering.

reinvention?

Posted Feb 2, 2006 4:53 UTC (Thu) by bcd (guest, #11759) [Link]

Not quite true: STREAMS had built-in facilities to help avoid data copies. It is a heavily layered model, and the queueing is not extremely efficient, but byte-for-byte copies are minimal in a properly configured STREAMS system.

These channels are a totally different concept, though, and address a different problem entirely.

Van Jacobson's network channels -- Microkernel?

Posted Feb 1, 2006 15:41 UTC (Wed) by smoogen (subscriber, #97) [Link] (5 responses)

From a 10,000 m view, this looks either like the days of DOS where every application had its own TCP stack :), or a better take on microkernels. The kernel sets up the basic stuff for the machin, and the channels for the smaller daemons that then handle things like iptables, the stack etc. Each of these would be a herd of daemons interconnecting between each other.

[I am sorry for the hurd joke.. it was a very long night and I found it funny.]

Van Jacobson's network channels -- Microkernel?

Posted Feb 1, 2006 20:12 UTC (Wed) by NAR (subscriber, #1313) [Link] (4 responses)

Isn't it ironic that khttpd/tux sped up web server perfomance by moving protocol processing into the kernel, but now Van Jacobson can speed up web server perfomance by moving protocol processing out of the kernel :-)

Bye,NAR

Van Jacobson's network channels -- Microkernel?

Posted Feb 1, 2006 20:48 UTC (Wed) by JoeBuck (subscriber, #2330) [Link] (3 responses)

It's not really so surprising. The cost being avoided in both cases is context switching. The default way of doing things is that the lower-level processing is in the kernel and the higher-level processing is in userspace. Either moving almost everything into the kernel, or moving almost everything out, reduces the overhead.

Van Jacobson's network channels -- Microkernel?

Posted Feb 2, 2006 2:13 UTC (Thu) by jwb (guest, #15467) [Link]

In this case, the advantage is also that protocol processing is being done 1) in parallel, on an SMP machine; and 2) in the same cache space as the interested user process.

Van Jacobson's network channels -- Microkernel?

Posted Feb 2, 2006 4:05 UTC (Thu) by atai (subscriber, #10977) [Link] (1 responses)

So the Hurd still has hope for speed?

HURD fast?

Posted Feb 2, 2006 5:36 UTC (Thu) by xoddam (guest, #2322) [Link]

Not with Linux evolving at this rate :-)

Exokernels

Posted Feb 2, 2006 4:11 UTC (Thu) by ttfkam (guest, #29791) [Link] (3 responses)

Reading about moving the network stack out to userspace made me think of the design of MIT's Exokernel OS.

Just as Ingo Molnar's work in migrating the scheduler to an O(1) algorithm was a taste of more extensive implementation of O(1) algorithms in the kernel, perhaps elements of the exokernel will find their day in various parts of Linux.

Exokernels

Posted Feb 2, 2006 11:55 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link] (2 responses)

VJ said that the only reason that TCP/IP was done in the kernel in Multics in the first place was because that was the only place you could be guaranteed not to get paged out for two minutes at a stretch.

Exokernels

Posted Feb 4, 2006 19:44 UTC (Sat) by kbob (guest, #1770) [Link] (1 responses)

It would be unfortunate if TCP became incompatible with job control. I often suspend network jobs for various reasons, secure in the knowledge that the kernel will keep the TCP connection alive indefinitely.

Exokernels

Posted Feb 5, 2006 1:47 UTC (Sun) by lutchann (subscriber, #8872) [Link]

On the other hand, most connection-oriented application protocols these days have a timeout mechanism above the TCP. If you suspend your IMAP or XMPP client for more than a minute or two you're likely to have a broken connection when you foreground the application again.

Van Jacobson's network channels - how about network failovers?

Posted Feb 2, 2006 7:11 UTC (Thu) by jiri.hlusi (guest, #34016) [Link] (1 responses)

The throughput gain numbers presented are certainly interesting. But does it affect e.g. network failovers (bonding) as well? There is loads of useful stuff [potentially] done in the kernel space, if one only sees the need for using it. Moving lots of the add-on functionality to the user space libraries might put own price tag to the complexity of those libraries as such.

-- Jiri

Van Jacobson's network channels - how about network failovers?

Posted Feb 3, 2006 19:00 UTC (Fri) by caitlinbestler (guest, #32532) [Link]

Why wouldn't you be able to create a net channel
to a non-physical netdevice? You still have
removed the L4 processing from the kernel and
gotten the same reduction in context switching.

Interrupt Latencies?

Posted Feb 2, 2006 13:04 UTC (Thu) by simlo (guest, #10866) [Link] (2 responses)

This way work is moved from kernel threads into userspace and into the interrupt handler. The amount of work needed to be done in interrupt context is probably also dependent on the number of channels, i.e. number of open network sockets. Wouldn't we open the machine up to an effective DDOS attack: Spam the network with small packets. If the machine has a lot of network sockets open the each interrupt takes a long time to excecute and at some point there isn't any cpu left.

This issue is not so much problem if you run the handling in a thread (ksoftirqd) which will get lower priority as it starts to eat a lot of CPU. That way packets are dropped, but the rest of the system can run.

Interrupt Latencies?

Posted Feb 3, 2006 0:34 UTC (Fri) by xoddam (guest, #2322) [Link]

> This way work is moved from kernel threads into userspace and
> into the interrupt handler.

Packet handling isn't currently done in kernel threads, it's done in
tasklets (OK, tasklets do run in a softirqd thread with the RT patch).
That's a great way to hog a CPU, and the channel implementation fixes the
problem by passing work to a thread, just as you suggest! In the slides
you'll see that the example channel 'producer' code wakes the listening
thread, so the 'consumer' is necessarily a task (but not necessarily a
userspace one).

An O(log n) search algorithm in the isr would indeed have high latency
with a very large number of channels to choose from -- but there is no
reason why the isr would have to select the final target amongst millions
of user sockets; channels are just as good for intermediate queueing
within the kernel as they are for delivery to userspace.

Interrupt Latencies?

Posted Feb 8, 2006 9:02 UTC (Wed) by csamuel (✭ supporter ✭, #2624) [Link]

Erm, actually VJ was removing code from the interrupt context - he gave
the example of the e1000 driver going from ~700 lines of code executed in
an interrupt context down to ~400 lines.