Van Jacobson's network channels
Unfortunately, attending Van's talk at linux.conf.au this year was not in the program. Fortunately, David Miller was there and listening carefully. Van has figured out how the next round of networking performance improvements will happen, and he has the numbers to prove it. Expect some very interesting (and fundamental) changes in the Linux networking stack as Van's ideas are incorporated. This article attempts to cover the fundamentals of Van's scheme (called "channels") based on David's weblog entry and Van's slides [PDF].
Van, like many others, points out that the biggest impediment to scalability on contemporary hardware is memory performance. Current processors can often execute multiple instructions per nanosecond, but loading a cache line from memory still takes 50ns or more. So cache behavior will often be the dominant factor in the performance of kernel code. That is why simply making code smaller often makes it faster. The kernel developers understand cache behavior well, and much work has gone into improving cache utilization in the kernel.
The Linux networking stack (like all others) does a number of things which reduce cache performance, however. These include:
- Passing network packets through multiple layers of the kernel. When a
packet arrives, the network card's interrupt handler begins the task
of feeding the packet to the kernel. The remainder of the work may
well be performed at software interrupt level within the driver (in a
tasklet, perhaps). The core network processing happens in another
software interrupt. Copying the data (an expensive operation in
itself) to the application happens in kernel context. Finally the
application itself does something interesting with the data. The
context changes are expensive, and if any of these changes causes the
work to move from one CPU to another, a big cache penalty results.
Much work has been done to improve CPU locality in the networking
subsystem, but much remains to be done.
- Locking is expensive. Taking a lock requires a cross-system atomic
operation and moves a cache line between processors. Locking costs
have led to the development of lock-free techniques like seqlocks and read-copy-update, but the
the networking stack (like the rest of the kernel) remains full of locks.
- The networking code makes extensive use of queues implemented with doubly-linked lists. These lists have poor cache behavior since they require each user to make changes (and thus move cache lines) in multiple places.
To demonstrate what can happen, Van ran some netperf tests on an instrumented kernel. On a single CPU system, processor utilization was 50%, of which 16% was in the socket code, 5% in the scheduler, and 1% in the application. On a two-processor system, utilization went to 77%, including 24% in the socket code and 12% in the scheduler. That is a worst case scenario in at least one way: the application and the interrupt handler were configured to run on different CPUs. Things will not always be that bad in the real world, but, as the number of processors increases, the chances of the interrupt handler running on the same processor as any given application decrease.
The key to better networking scalability, says Van, is to get rid of locking and shared data as much as possible, and to make sure that as much processing work as possible is done on the CPU where the application is running. It is, he says, simply the end-to-end principle in action yet again. This principle, which says that all of the intelligence in the network belongs at the ends of the connections, doesn't stop at the kernel. It should continue, pushing as much work as possible out of the core kernel and toward the actual applications.
The tool used to make this shift happen is the "net channel," intended to be a replacement for the socket buffers and queues used in the kernel now. Some details of how channels are implemented can be found in Van's slides, but all that really matters is the core concept: a channel is a carefully designed circular buffer. Properly done, circular buffers require no locks and share no writable cache lines between the producer and the consumer. So adding data to (or removing data from) a net channel will be a fast, cache-friendly operation.
As a first step, channels can be pushed into the driver interface. A network driver need no longer be aware of sk_buff structures and such; instead, it simply drops incoming packets into a channel as they are received. Making this change cuts the CPU utilization in the two-processor case back to 58%. But things need not stop there. A next logical step would be to get rid of the networking stack processing at softirq level and to feed packets directly into the socket code via a channel. Doing that requires creating a separate channel for each socket and adding a simple packet classifier so that the driver knows which channel should get each packet. The socket code must also be rewritten to do the protocol processing (using the existing kernel code). That change drops the overall CPU utilization to 28%, with the portion spent at softirq level dropping to zero.
But why stop there? If one wants to be serious about this end-to-end thing, one could connect the channel directly to the application. Said application gets the packet buffers mapped directly into its address space and performs protocol processing by way of a user-space library. This would be a huge change in how Linux does networking, but Van's results speak for themselves. Here is his table showing the percentage CPU utilization for each of the cases described above:
Total CPU Interrupt SoftIRQ Socket Locks Sched App. 1 CPU 50 7 11 16 8 5 1 2 CPUs 77 9 13 24 14 12 1 Driver channel 58 6 12 16 9 9 1 Socket channel 28 6 0 16 1 3 1 App. channel 14 6 0 0 0 2 5
The bottom line (literally) is this: processing time for the packet stream dropped to just over 25% of the previous single-CPU case, and less than 20% of the previous two-CPU behavior. Three layers of kernel code have been shorted out altogether, with the remaining work performed in the driver interrupt handler and the application itself. The test system running with the full application channel code was able to handle twice the network bandwidth as an unmodified system - with the processors idle most of the time.
Linux networking hackers have always been highly attentive to performance
issues, so numbers like these are bound to get their attention. Beyond
performance, however, this approach promises simpler drivers and a
reasonably straightforward transition between the current stack and a
future stack built around channels. A channel-based user-space interface
will make it easy to create applications which can send and receive packets
using any protocol. If Van's results hold together in a "real-world"
implementation, the only remaining question would be: when will it be
merged so the rest of us can use it?
Index entries for this article | |
---|---|
Kernel | Networking/Channels |
Posted Jan 31, 2006 20:43 UTC (Tue)
by imcdnzl (guest, #28899)
[Link]
Posted Jan 31, 2006 21:23 UTC (Tue)
by job (guest, #670)
[Link] (3 responses)
Posted Jan 31, 2006 21:28 UTC (Tue)
by cloose (guest, #5066)
[Link]
Thank you for yet another enlightening article which explains a
foreign
concepts so even mortals have at least a remote chance of understanding
them! I totally agree. The just renewed subscription payed off again.
Thank you for this great article!
Posted Feb 1, 2006 15:19 UTC (Wed)
by Baylink (guest, #755)
[Link]
Do more. :-)
Posted Feb 9, 2006 9:05 UTC (Thu)
by burki99 (subscriber, #17149)
[Link]
Posted Jan 31, 2006 22:02 UTC (Tue)
by ernest (guest, #2355)
[Link] (10 responses)
Posted Jan 31, 2006 22:29 UTC (Tue)
by csamuel (✭ supporter ✭, #2624)
[Link]
His comment was that there was no reason why netfilter couldn't become just another consumer of packets, and my take on that is that whilst that would require patching to the netfilter code that too could be a good thing if it eliminates the use of double-linked lists (the use of which would cause you to fail CS 101 under VJ according to him :-)) but would probably be a lot of code.
Of course this is only necessary if you go further than chanellising the drivers themselves as one of the really elegant things is that this change of thinking is that its very modular - you can convert drivers one at a time until they're all done, then start on looking at channelising the socket layer and then start on the consumers of the socket layer.
There are 2 nice things about having the TCP stack running in user space, one is that it allows you to easily experiment and debug TCP issues and have custom behaviour for different applications based on need, and the second is that VJ explained the only reason it had to go into the kernel in Multics in the first place was that if a user process got pages out there it could take 2 minutes to get paged in, which TCP/IP doesn't like.. :-)
Chris
Posted Feb 1, 2006 11:00 UTC (Wed)
by james (subscriber, #1325)
[Link] (5 responses)
That looks like it's enough for most firewalling: it should give you pass (existing channel), fail (no channel), or needs more work (channel to netfilter).
Posted Feb 1, 2006 20:00 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (4 responses)
Posted Feb 2, 2006 5:45 UTC (Thu)
by xoddam (guest, #2322)
[Link] (2 responses)
Posted Feb 2, 2006 9:57 UTC (Thu)
by NAR (subscriber, #1313)
[Link] (1 responses)
Posted Feb 2, 2006 23:56 UTC (Thu)
by xoddam (guest, #2322)
[Link]
Posted Feb 2, 2006 21:46 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Of course, the kernel would have to keep a TCP implementation, but that's not surprising, since static binaries that use sockets should continue to work.
Posted Feb 2, 2006 12:29 UTC (Thu)
by samj (guest, #7135)
[Link] (1 responses)
Posted Feb 2, 2006 20:12 UTC (Thu)
by jonabbey (guest, #2736)
[Link]
Posted Feb 2, 2006 20:53 UTC (Thu)
by caitlinbestler (guest, #32532)
[Link]
Posted Jan 31, 2006 22:36 UTC (Tue)
by xav (guest, #18536)
[Link]
Posted Jan 31, 2006 22:49 UTC (Tue)
by csamuel (✭ supporter ✭, #2624)
[Link] (2 responses)
0) VJ said that the talk was *not* about fixing the Linux TCP stack as "the Linux TCP stack isn't broken" - but just because something has always been done this way (SKB's) doesn't mean it is necessarily the best way.
1) VJ was asked about this code going into the kernel - his reply was that he would slap the GPL onto his new code for the drivers and the socket layer on the plane home but that the user layer TCP side may be a bit more difficult as he needs to get agreement from others.
2) The entire user level TCP stack was done as a LD_PRELOAD'ed library and hence no actual changes to applications are necessary, so people can experiment to their hearts content with tuning TCP application by application. Fancy an Apache with a different congestion control method to your OpenSSH clients and server ? Go for it..
3) This reduced the amount of code in the interrupt handler of the e1000 considerably (from ~700 lines down to ~300) and removes all SKB code hence simplifing drivers, which can only be a good thing.
4) The new channelised napi_poll() routine is generic, rather than the current device-dependant implementations.
All in all an excellent talk
Posted Feb 1, 2006 0:57 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Feb 7, 2006 18:15 UTC (Tue)
by arafel (guest, #18557)
[Link]
Posted Jan 31, 2006 22:54 UTC (Tue)
by csamuel (✭ supporter ✭, #2624)
[Link] (4 responses)
1) His test results for 10Gb/s ethernet were limited not by drivers, kernel or networking but by the memory bandwidth of the DDR333 chips in the system he was testing on!
He estimated that you would need at least DDR800 RAM to be able to have enough memory bandwidth to drive 10Gb/s at capacity.
2) VJ said anyone who tells you that you'll need a TOE is not telling the truth. OK - this was known already but could be handy for beating vendors around the head with.. :-)
Posted Feb 1, 2006 2:17 UTC (Wed)
by bk (guest, #25617)
[Link] (1 responses)
Posted Feb 1, 2006 2:25 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted Feb 2, 2006 5:35 UTC (Thu)
by bos (guest, #6154)
[Link] (1 responses)
Not having been at the talk, I don't know what circumstances he was talking about (perhaps specifically TCP at 10Gbps?), so I'm not picking a nit with his assertion, just pointing out that something along those lines can be done without scads of memory bandwidth.
Posted Feb 8, 2006 8:57 UTC (Wed)
by csamuel (✭ supporter ✭, #2624)
[Link]
Posted Jan 31, 2006 23:27 UTC (Tue)
by dwheeler (guest, #1216)
[Link] (16 responses)
But it appears to me that this has a dark side. Today, because the kernel assembles packets, then only trusted (root) programs can forge packets and create many kinds of funky attack packets. If user-level applications can create arbitrary packets, then ANYONE -- even untrusted applications -- can forge arbitrary packets and arbitrary attack packets.
Clearly, in some situations this wouldn't matter. But historically getting only "user privileges" limited what you could do, including having to give away your IP address and only being able to send certain kinds of packets. This gives a new weapon, not so much against the machine IMPLEMENTING the new approach, but against OTHER machines (whether or not they do so). Today, given only low privileges, you can't create funky packets (like Xmas tree ones) or total forgeries. Unless there are kernel-level checks or I misunderstand something, you CAN cause these problems.
Just imagine; a user-space app sends out a broadcast from 127.0.0.1, etc., etc. There's a LOT of mischief that's been limited to kernel-space programs before that this might expose.
I'd like to see an implementation automatically check the outgoing packets for certain properties as part of the kernel (e.g., valid sender IP and port address, etc.). But I fear that won't happen by default, because (1) that would take extra time, and (2) it only affects OTHER people. And I understand (though don't agree with) the other side of the coin: Yes, of course people who have root can send any packet. That's not my point. My point is that for a large, shared network to be useful, there needs to be a defense-in-depth so that attackers aren't automatically given the whole store when they get just a little privilege. This would kick away one of those mechanisms.
Eek.
Posted Feb 1, 2006 2:20 UTC (Wed)
by elanthis (guest, #6227)
[Link] (13 responses)
Absolutely nothing stops a user from booting their workstation with a LiveCD that they have root access to. Or plugging in a different machine to the network. Or rebooting into single-user mode.
You cannot rely on a per-machine control like root access to protect your network. If you want to do that, you have to have some sort of encryption/signing on every network packet sent and physically lock down the end-user workstations so that they can't reboot into single-user mode or pop in a LiveCD or modify/replace the hard disk.
Posted Feb 1, 2006 2:34 UTC (Wed)
by Ross (guest, #4065)
[Link] (8 responses)
If it could, that's what I'd call a gaping security hole.
Posted Feb 1, 2006 3:57 UTC (Wed)
by xoddam (guest, #2322)
[Link] (3 responses)
Posted Feb 1, 2006 6:38 UTC (Wed)
by cventers (guest, #31465)
[Link] (2 responses)
Posted Feb 3, 2006 4:15 UTC (Fri)
by zblaxell (subscriber, #26385)
[Link] (1 responses)
Posted Feb 3, 2006 5:48 UTC (Fri)
by xoddam (guest, #2322)
[Link]
Posted Feb 1, 2006 12:15 UTC (Wed)
by smitty_one_each (subscriber, #28989)
[Link] (3 responses)
Posted Feb 1, 2006 13:54 UTC (Wed)
by Ross (guest, #4065)
[Link] (2 responses)
Posted Feb 2, 2006 4:05 UTC (Thu)
by elanthis (guest, #6227)
[Link] (1 responses)
If you are implicitly trusting every packet sent by some 'trusted' host (which, if it were truly trusted, would never be running any malicious code anyhow), or trusting anything running on port 1024 down, you're not running a very secure network at all.
There is no security at the IP level at all. If you want trust and security, you have to put it all in higher layers.
Posted Feb 2, 2006 5:28 UTC (Thu)
by Ross (guest, #4065)
[Link]
Posted Feb 1, 2006 16:35 UTC (Wed)
by dwheeler (guest, #1216)
[Link]
That's not the point. Not all users are malicious. Sometimes they run programs that APPEAR to do one thing, but do another. Sometimes systems run servers (like web servers) that an attacker can somehow subvert. In THOSE cases, I'd like the system to still limit what the attacker can do, INCLUDING limits on how the attacker can attack other systems.
Sure, all systems should be invulnerable to all attackers. But they aren't. Anyone who's managed a big network knows how hard it is to keep EVERYTHING secure, ESPECIALLY since there are some vendors who do not release patches for KNOWN vulnerabilities (names withheld, but Google can help you find them rather quickly). So you really need defense-in-depth: you need to try to make it so that attackers have to break down MULTIPLE barriers to get the goods.
Limiting the network-level actions of unprivileged accounts is not the be-all of security. But it's one of the few mechanisms we CURRENTLY have deployed widely that slow the spread of attacks across a network. Diseases that spread rapidly are often unstoppable, because you just don't have enough time to react. Slowing the spread of a disease is key to countering it. Similarly, in the network world, slowing down attack vectors is also key to countering it.
I'd like to see that packets from untrusted user apps are still FORCED to obey certain limits on what they can send.
You don't need a system-wide lock to do that kind of checking; after a call to the kernel, the memory could be mapped out and checked WITHOUT harming the cache lines of other systems.
For most systems it'd just involve checking a few bytes... nothing expensive, and certainly taking less time than sending something down any network port.
Posted Feb 1, 2006 20:08 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (1 responses)
Except the fact that this user can be an unauthorized one who've just cracked into the system from an other continent using the latest bug in a PHP BBS and his processes are running as 'nobody' user. He'd have hard time putting a live CD into the computer, but still we really don't want him to send arbitrary packets into the network.
Posted Feb 8, 2006 13:14 UTC (Wed)
by jzbiciak (guest, #5246)
[Link]
That way, the TCP/IP implementation can be stored away in a fixed implementation that root checks in on (and the kernel may even checksum at lauch time), but the processing still lives in userspace. It looks a little like the priv-sep that sshd uses.
Granted, with two cooperating threads, you get back to some of the context switching issues, but still it feels a little more flexible than keeping it in kernel space.
Posted Feb 11, 2006 9:33 UTC (Sat)
by efexis (guest, #26355)
[Link]
You obviously don't see this practice taking off, but even so, I think if you told someone who offers shared hosting, not to bother protecting against ways unprivilidged users can cause havock, because somebody "can simple break into the datacenter with a boot CD or laptop"... you'd probably get an incredibly sarcastic response.
But no, you stick to talking about how to improve the performance of a 10Gig heavy loaded network interface --on a workstation-- where it'll really count.
Posted Feb 2, 2006 4:11 UTC (Thu)
by jamesh (guest, #1159)
[Link]
As for sending, all that is necessary is a packet verifier, that makes sure the packet is appropriate for the given socket (which should be a lot simpler than a full TCP send implementation). If the packet shouldn't be getting sent from the socket, the kernel doesn't transmit it.
Posted Feb 3, 2006 18:57 UTC (Fri)
by caitlinbestler (guest, #32532)
[Link]
So the privileged end of the channel can validate
Posted Feb 1, 2006 0:00 UTC (Wed)
by busterb (subscriber, #560)
[Link]
Posted Feb 1, 2006 9:51 UTC (Wed)
by etroup (guest, #21786)
[Link] (4 responses)
Posted Feb 1, 2006 10:19 UTC (Wed)
by ctg (guest, #3459)
[Link] (3 responses)
Posted Feb 1, 2006 15:51 UTC (Wed)
by macc (guest, #510)
[Link] (2 responses)
could you elaborate on the differences?
Posted Feb 1, 2006 16:39 UTC (Wed)
by vmole (guest, #111)
[Link] (1 responses)
Posted Feb 2, 2006 4:53 UTC (Thu)
by bcd (guest, #11759)
[Link]
These channels are a totally different concept, though, and address a different problem entirely.
Posted Feb 1, 2006 15:41 UTC (Wed)
by smoogen (subscriber, #97)
[Link] (5 responses)
[I am sorry for the hurd joke.. it was a very long night and I found it funny.]
Posted Feb 1, 2006 20:12 UTC (Wed)
by NAR (subscriber, #1313)
[Link] (4 responses)
Posted Feb 1, 2006 20:48 UTC (Wed)
by JoeBuck (subscriber, #2330)
[Link] (3 responses)
Posted Feb 2, 2006 2:13 UTC (Thu)
by jwb (guest, #15467)
[Link]
Posted Feb 2, 2006 4:11 UTC (Thu)
by ttfkam (guest, #29791)
[Link] (3 responses)
Just as Ingo Molnar's work in migrating the scheduler to an O(1) algorithm was a taste of more extensive implementation of O(1) algorithms in the kernel, perhaps elements of the exokernel will find their day in various parts of Linux.
Posted Feb 2, 2006 11:55 UTC (Thu)
by csamuel (✭ supporter ✭, #2624)
[Link] (2 responses)
VJ said that the only reason that TCP/IP was done in the kernel in
Multics
in the first place was because that was the only place you could be
guaranteed not to get paged out for two minutes at a stretch.
Posted Feb 4, 2006 19:44 UTC (Sat)
by kbob (guest, #1770)
[Link] (1 responses)
Posted Feb 5, 2006 1:47 UTC (Sun)
by lutchann (subscriber, #8872)
[Link]
Posted Feb 2, 2006 7:11 UTC (Thu)
by jiri.hlusi (guest, #34016)
[Link] (1 responses)
-- Jiri
Posted Feb 3, 2006 19:00 UTC (Fri)
by caitlinbestler (guest, #32532)
[Link]
Posted Feb 2, 2006 13:04 UTC (Thu)
by simlo (guest, #10866)
[Link] (2 responses)
This issue is not so much problem if you run the handling in a thread (ksoftirqd) which will get lower priority as it starts to eat a lot of CPU. That way packets are dropped, but the rest of the system can run.
Posted Feb 3, 2006 0:34 UTC (Fri)
by xoddam (guest, #2322)
[Link]
Posted Feb 8, 2006 9:02 UTC (Wed)
by csamuel (✭ supporter ✭, #2624)
[Link]
A small correction to your story (as I was fortunate enough to be there) - the 50% is 100% of 1 CPU on a 2 CPU system - he was showing the results of the network card driver and the stack being bound to the same CPU.Small correction
Thank you for yet another enlightening article which explains a foreign Van Jacobson's network channels
concepts so even mortals have at least a remote chance of understanding
them! The concept sounds very interesting and I really look forward to
seeing it in code.
Van Jacobson's network channels
And, just to hammer the point home, Jon: you're at your best (and most compensable) when you're doing original journalism. We pay you for your opinions, and your clarifications (like this one) of complicated topics.I concur on both points
After reading this article I immediately though: Wow, maybe I should finally subscribe after years of reading LWN a week delayed. This is the kind of article that differentiates this publication from the hundreds of other publications that just grabb a quote from LKML (Linus says no to GPL3) and add no insight at all.Van Jacobson's network channels
Interesting, but where do things like iptable and other network security Van Jacobson's network channels
fall in if the kernel doesn't do anything anymore with network packets ?
up to now that part fell into the Socket interface (I think). I can't
beleive that last remaining 0% cpu still contains iptable handling.
can IP security be delegated to userspace ?
ernest.
I was fortunate enough to be both at the original presentation and when he repeated it for the "best of" stream at the end of LCA2006 and got to ask him about what the situation was with netfilter.Van Jacobson's network channels and Netfilter
Presumably you can do a lot of security when you set up the channels. It looks like the packet classifier:
Van Jacobson's network channels
reads the protocol, ports, and addresses to determine the flow ID and uses this to find a channel
(Dave
Miller's blog).
I'm not sure I fully understand this, but it seems that these channels are used when there is a socket to the user space, i.e. an application running on the host sends/receives data to/from the network. But what about the case when there's no application? As far as I know, in routers the IP packets usually don't get to user space, but if protocol processing is moved to user space (netfilter), it might degrade performance, mightn't it?
Van Jacobson's network channels
The phased implementation described only moves packet processing to Van Jacobson's network channels
userspace at the very last stage. At the 'ends' of the network this is
appropriate for efficiency. But even before that stage, channels are a
better way to pass packets around within the kernel. The task-oriented
interface (using wakeups instead of soft interrupts) would probably mean
netfilter no longer runs in tasklet context. We might instead see
several netfilter kernelspace daemons (like kswapd and friends), one for
each CPU.
Wouldn't it lead to code duplication? For example, a box doing NAT would need a (limited?) TCP/IP implementation in kernel space, while the host running e.g. an FTP client would need the full TCP/IP implementation in user space.
Van Jacobson's network channels
> Wouldn't it lead to code duplication? Van Jacobson's network channels
Yes. So does inlining :-)
I don't see any reason that all of the channels would have to go to userspace. If a packet is to a kernel NFS client, it would end up in the kernel code, but without all the copies between the network and the VFS.Van Jacobson's network channels
If it doesn't cost anything, why not? You'd just plug netfilter in before the app and map packet buffers into its address space first. This can all be done in a separate security context too. Looks to me like it would mean a lot less in the way of protocol specific handling and would allow you to chain such tasks easily (eg netfilter->ipsec or netfilter->reverse proxy->web server etc.).Van Jacobson's network channels
Please, someone tell me they're not re-inventing STREAMS.Van Jacobson's network channels
Connections can be channelized *after* they haveVan Jacobson's network channels
passed netfilter inspection.
s/want to 77%/went to 77%/nitpicking
A couple of other quick comments from my notes of the two instances of this talk done at LCA 2006:Van Jacobson's network channels
Obviously in a full-userspace-TCP/IP implementation, the default implementation would go into something on a par with glibc. Giving everything that wants to do networking an LD_PRELOAD would hammer performance and give Ulrich Drepper an aneurysm ;)Van Jacobson's network channels
Well, y'know, I'm sure Ulrich's not got enough to do these days... ')Van Jacobson's network channels
Two final comments - honest!Van Jacobson's network channels & 10 Gb/s ethernet
What is a TOE?Van Jacobson's network channels & 10 Gb/s ethernet
TOE = "TCP Offload Engine," the TCP protocol implemented in adapter firmware. See this Kernel Page article from last August for one Linux-based implementation and all the reasons why it didn't get merged.
TOE
Plenty of Linux networking gear can drive 10Gbps hardware at line rate, and I'm not even talking about fancy TOE hardware in all cases.Van Jacobson's network channels & 10 Gb/s ethernet
Yes, this is about TCP, not just pushing datagrams out.. Van Jacobson's network channels & 10 Gb/s ethernet
This looks VERY interesting, and I expect that this WILL be implemented. I think this is (mostly) a very good idea.
VERY interesting - but security implications to others?!?
That whole "only root can do stuff to the network" reasoning is complete bunk.VERY interesting - but security implications to others?!?
I sure hope that no application which I run as a normal user is able to reboot the system into another operating system in order to use raw sockets and low port numbers.That's not the only meaning of that statement
can imagine an internal 'firewall' inspecting the header of each packet
traversing a channel from userspace to ensure the app has sufficient
privilege to send it. A pipeline stage with negligible performance
impact -- it wouldn't thrash the cache, and if it's in the kernel it
would involve no extra context switches.
protocol validity checks
Yeah, since you're writing into mapped memory, the kernel can check it protocol validity checks
out in place. And since there's no copy, it's going to be hanging out in
the cache when the check has to take place.
kernel can check it out in place...while the user, maybe on another CPU, switches a few bits just after the kernel check but before the network card picks up the data.protocol validity checks
Ok, freely mapped memory doesn't cut it then. I wonder what the Sneaky indeed!
performance impact of changing packet buffers' page permissions would be,
relative to copying (and relative to keeping the TCP implementation in
kernel space)?
I think parent's point was about a living user doing a cold boot with a live CD, and then comitting mischief, not about an existing application you're running somehow warm booting and doing that.That's not the only meaning of that statement
Clearly, unprotected ON/OFF switches and promiscuous BIOS boot settings can be a gaping security hole.
Yes and my point what is equivalent with physical security is not the totality of the problem. Of course if someone has physical access to your network they can any packets they like on it. But unpriviledged processes running on your server don't have physical access, but in the same scenario they would have the same level of access.That's not the only meaning of that statement
And my point remains... what is that unprivileged process going to do that you couldn't do by plugging in a laptop or some other device onto the network?That's not the only meaning of that statement
If you're only point is that security shouldn't depend on the network not being compromised I agree. However malicious users with unfettered physical access are not at all equivalent to malicious processes running under unpriviledged ids and that anything which makes them equivalent is decreasing security. Does it matter for well designed programs? No. But unfortunately tons of commonly used software is not well designed. If you can't trust IPs, port numbers, etc. many things break down. If you can't trust a program a user downloaded you should worry, but your network is not automatically compromised unless there is something which can be exploited on the system.That's not the only meaning of that statement
Sure, if a user is malicious, they can boot the system into some OS where they're fully privileged and attack. Or just unplug the network, plug in their own laptop where they have all privileges, and attack.
Users don't always WANT to attack
Absolutely nothing stops a user from booting their workstation with a LiveCD that they have root access to.
VERY interesting - but security implications to others?!?
I wonder if you can still get most of the benefits of network channels if you limit their accessibility to special user IDs, and then require non-privileged applications to use cooperating threads--one privileged, one not--to send packets.VERY interesting - but security implications to others?!?
If my memory serves me correctly, there are one or two servers out there on the internet that run linux. I know it's not very common, but some of them actually rent space/bandwidth/etc so that people can host their own websites, and they allow these users to run code (cgi scripts) or even have shell access via telnet/ssh.VERY interesting - but security implications to others?!?
There isn't much problem with this model on the receive end, assuming that the kernel correctly classifies packets and sends them to the right process.VERY interesting - but security implications to others?!?
The same filter rules that route inbound packetsVERY interesting - but security implications to others?!?
can be used to validate outbound packets. You
simply do not accept packets from a channel if
the response packet would not be routed to
the matching channel.
that every packet on it is for a TCP connection
that is actually assigned to that channel.
Arsenic was an earlier implementation of a similar idea with apparently Van Jacobson's network channels
good results:
http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic/
So is anyone else reminded of System V Streams
?
reinvention?
No. Streams are almost antithetical to Channels.reinvention?
I had the same feeling of similarity.reinvention?
Streams were about introducing lots of layers, each doing some sort of processing on the packets (or whatever), and each layer involving copying the data. VJ's channels are about getting the data into user space as quickly as possible, minimizing the actual processing and layering.reinvention?
Not quite true: STREAMS had built-in facilities to help avoid data copies. It is a heavily layered model, and the queueing is not extremely efficient, but byte-for-byte copies are minimal in a properly configured STREAMS system.reinvention?
From a 10,000 m view, this looks either like the days of DOS where every application had its own TCP stack :), or a better take on microkernels. The kernel sets up the basic stuff for the machin, and the channels for the smaller daemons that then handle things like iptables, the stack etc. Each of these would be a herd of daemons interconnecting between each other. Van Jacobson's network channels -- Microkernel?
Isn't it ironic that khttpd/tux sped up web server perfomance by moving protocol processing into the kernel, but now Van Jacobson can speed up web server perfomance by moving protocol processing out of the kernel :-)
Van Jacobson's network channels -- Microkernel?
It's not really so surprising. The cost being avoided in both cases is context switching. The default way of doing things is that the lower-level processing is in the kernel and the higher-level processing is in userspace. Either moving almost everything into the kernel, or moving almost everything out, reduces the overhead.
Van Jacobson's network channels -- Microkernel?
In this case, the advantage is also that protocol processing is being done 1) in parallel, on an SMP machine; and 2) in the same cache space as the interested user process.Van Jacobson's network channels -- Microkernel?
Reading about moving the network stack out to userspace made me think of the design of MIT's Exokernel OS.
Exokernels
Exokernels
It would be unfortunate if TCP became incompatible with job control. I often suspend network jobs for various reasons, secure in the knowledge that the kernel will keep the TCP connection alive indefinitely.Exokernels
On the other hand, most connection-oriented application protocols these days have a timeout mechanism above the TCP. If you suspend your IMAP or XMPP client for more than a minute or two you're likely to have a broken connection when you foreground the application again.Exokernels
The throughput gain numbers presented are certainly interesting. But does it affect e.g. network failovers (bonding) as well? There is loads of useful stuff [potentially] done in the kernel space, if one only sees the need for using it. Moving lots of the add-on functionality to the user space libraries might put own price tag to the complexity of those libraries as such.Van Jacobson's network channels - how about network failovers?
Why wouldn't you be able to create a net channelVan Jacobson's network channels - how about network failovers?
to a non-physical netdevice? You still have
removed the L4 processing from the kernel and
gotten the same reduction in context switching.
This way work is moved from kernel threads into userspace and into the interrupt handler. The amount of work needed to be done in interrupt context is probably also dependent on the number of channels, i.e. number of open network sockets. Wouldn't we open the machine up to an effective DDOS attack: Spam the network with small packets. If the machine has a lot of network sockets open the each interrupt takes a long time to excecute and at some point there isn't any cpu left. Interrupt Latencies?
> This way work is moved from kernel threads into userspace and Interrupt Latencies?
> into the interrupt handler.
Packet handling isn't currently done in kernel threads, it's done in
tasklets (OK, tasklets do run in a softirqd thread with the RT patch).
That's a great way to hog a CPU, and the channel implementation fixes the
problem by passing work to a thread, just as you suggest! In the slides
you'll see that the example channel 'producer' code wakes the listening
thread, so the 'consumer' is necessarily a task (but not necessarily a
userspace one).
An O(log n) search algorithm in the isr would indeed have high latency
with a very large number of channels to choose from -- but there is no
reason why the isr would have to select the final target amongst millions
of user sockets; channels are just as good for intermediate queueing
within the kernel as they are for delivery to userspace.
Erm, actually VJ was removing code from the interrupt context - he gave Interrupt Latencies?
the example of the e1000 driver going from ~700 lines of code executed in
an interrupt context down to ~400 lines.