Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch remains 2.6.12-rc5. Linus's git repository contains 200 or so patches; these are mostly fixes, but there is also a conversion of the IDE driver code to the device model, a new Broadcom bcm5706 gigabit driver, the removal of the Philips webcam decompression code, an IPv4 "alias promotion" feature (make a secondary interface address into the primary if the previous primary is deleted), and an updated CPU frequency subsystem.

The current -mm tree is 2.6.12-rc5-mm2. Recent changes to -mm include the pluggable congestion avoidance modules patch, some filesystem namespace patches, some scheduler tweaks, and lots of fixes.

The current stable 2.6 kernel is 2.6.11.11, released on May 27.

The current 2.4 kernel is 2.4.31, released by Marcelo on May 31. 2.4.31 contains quite a few fixes and some driver updates, but new features are no longer being added to 2.4.

Comments (none posted)

The ongoing Philips webcam driver saga

Linus has just merged a patch from Alan Cox removing some of the new decompression code from the Philips webcam driver. "The original pwc author raised some questions about the reverse engineering of the decompressor algorithms used in the pwc driver. Having done some detailed investigation it appears those concerns that clean room policy was not followed are reasonable." The hope, at this point, is to merge an improved version of the driver in 2.6.13 which will support (properly reverse-engineered) decompression modules in user space.

Comments (5 posted)

Time to remove LSM?

The first organized kernel summit, held in 2001, included a presentation on the NSA Security-Enhanced Linux project. Linus's response at the time was that there were several projects out there trying to find the best way to harden Linux, and that he did not want to have to choose between them. Instead, he asked for the creation of a generic framework which would allow an arbitrary security module to be plugged into the system. The result, some time later, was the Linux Security Module framework; LSM provides a long list of hooks into kernel operations which allow a security module to veto any action which violates the rules it is implementing.

The LSM patch ran into some difficulties on its way into the kernel, but it is now an established part of the internal API. So some developers were surprised recently when James Morris suggested that perhaps the time has come to remove the LSM framework. His arguments are simple: there is only one serious module using the LSM framework in the intended manner, while unrelated projects are trying to use it in inappropriate ways.

In the years since LSM was included in the mainline kernel, SELinux has been the only significant module implemented and also included in the mainline kernel. So we have a generalized framework for one user, SELinux, which itself is a generalized framework....

It's dead code, an unnecessary abstraction layer between its one real user, SELinux, and the core kernel.

James asks: rather than forcing SELinux to conform to a general-purpose API (of which it is the sole user), why not just wire SELinux directly into the kernel, get rid of LSM, and be done with it?

SELinux is not truly the only security module out there, of course. The kernel includes a couple of other modules: a reimplementation of the capabilities mechanism and "root plug," a module which prevents processes from running as root unless a specific USB device is plugged in. There are out-of-tree modules, such as the BSD securelevels patch and Trustees Linux. The Immunix (now Novell) AppArmor product includes a module which uses the LSM framework. AppArmor is a proprietary offering, but the security module portion of it is GPL-licensed (as is necessary, since the functions for loading security modules are exported GPL-only).

There does not appear to be a groundswell of support for the idea of removing the LSM framework from the kernel at this time. That could change over time, however: increasingly, out-of-tree code is held to be irrelevant when decisions are made. If SELinux remains the only significant in-tree user of the LSM framework, LSM will look like useless baggage to more and more developers. If there are security modules out there which are reasonable alternatives to SELinux, their developers may want to think about getting them into the mainline sometime in the not-too-distant future.

Comments (5 posted)

Files with negative offsets

Every open file on a Linux system has an associated offset - the current read or write position within that file. The virtual filesystem code, when dealing with file positions, performs some basic checks, such as ensuring that the position is not negative. After all, what sense does it make to talk about a file position before the beginning of the file?

As it turns out, there is a situation where a negative file position makes sense. Special files (such as /dev/mem and /dev/kmem) provide a window into the system's main memory. The "position" within these files corresponds to the address of the memory of interest. The interesting thing is that, on the x86_64 platform, addresses can be negative numbers.

This comes about as follows: this architecture currently uses a 48-bit address space. The hardware sign-extends the uppermost bit, however, so any address with that bit set will turn into a negative number. The x86_64 Linux port uses the upper bit to mark kernel space, so kernel addresses are, in fact, negative. A quick look at /proc/kallsyms confirms this:

    ffffffff80100000 T startup_32
    ffffffff80100100 T startup_64
    ffffffff801001a0 T initial_code
    ffffffff801001a8 T init_rsp
    ffffffff801001b0 T early_idt_handler
    ...

The end result is that using /dev/kmem on an x86_64 system is difficult; any attempt to seek into kernel space will yield an error.

The clear fix is to modify the VFS layer to let negative file positions be passed through to the underlying filesystem or device driver. The problem with doing that in a general way, however, is that not all code (especially in drivers) is prepared to deal with a negative offset. Suddenly exposing that code to negative offsets could open up no end of bugs and security problems. So the real solution, as worked out by Al Viro and Linus Torvalds, is to add a new flag for the file structure called FMODE_ANY_OFFSET. This flag can only be set within the kernel; user space has no access to it. So the /dev/kmem driver will be able to set the flag and work with the full range of offsets, but, for the rest of the system, nothing will change.

Comments (10 posted)

The beginning of the realtime preemption debate

Merging Ingo Molnar's realtime preemption work was never going to be a quiet process. The noise has, in fact, begun long before Ingo has even proposed his work for inclusion. Now might be a good time to catch up with the debate as a way of seeing how the arguments might go in the future.

The realtime preemption patches attempt to provide a guaranteed maximum response time for high-priority user-space processes - just like a "real" realtime operating system would. This goal is achieved by making everything in the kernel preemptible. No matter what the kernel is doing on a given processor, if a higher-priority process becomes runnable, it will be scheduled immediately. Many changes are required to make the whole kernel preemptible; the core parts are:

New locking primitives. The spinlocks used by the kernel can cause any number of processors to stall while waiting for a lock to become free. Code which holds a spinlock cannot be preempted, or a deadlocked kernel could result. The realtime preemption patches introduce a new mutual exclusion type (the rt_mutex) which does not spin, and, thus, will not stall a processor. The spinlocks and semaphores currently used in the kernel are all converted over to the new rt_mutex type, and all code which runs with spinlocks held becomes preemptible. The rt_mutex type also implements priority inheritance, so that a low-priority process will not block a higher-priority process (for long, at least) by losing the processor while holding an important lock.
Threaded interrupt handlers. Interrupt handlers can create latencies by monopolizing the processor for long periods of time. The realtime preemption patch moves interrupt handling into kernel threads, which contend for the processor with all other processes in the system. If a certain realtime task is more important than interrupt handling, its priority can be set accordingly.
Various other mutual exclusion mechanisms, including read-copy-update, per-CPU variables, and seqlocks, require that preemption be disabled. All of these mechanisms are changed for the realtime preemption mode, usually by making them look more like regular spinlocks.

The realtime preemption patch set (at version -RT-2.6.12-rc5-V0.7.47-10 as of this writing) is clearly large and intrusive - it would be hard to make fundamental changes like those listed above any other way. It should be noted that Ingo has gone out of his way to minimize this intrusiveness, however: the patch is written to minimize code changes, and the kernel functions as always if realtime preemption is not selected at configuration time. The merging of this patch set would not force the new preemption model on users.

According to Lee Revell, the realtime preemption patches are already seeing some serious use:

All of the Linux audio oriented distributions are already shipping -RT kernels, and most of the serious Linux audio users who use general purpose distros are running it. That's a few thousand people running it 24/7 for months, and it's been at least a month since any of these users found a real bug in -RT.

Certainly the discussions that inevitably follow the release of a new version of the patch set indicate that there is an active user community out there. Some members of the community are starting to wonder why the realtime preemption patches have not been merged, and when (if ever) that might change. The biggest reason is that Ingo has not yet requested that the patches be included - though many small pieces and fixes from the realtime patch set have found their way into the mainline. If and when Ingo does push for inclusion, however, there will be some opposition.

To some developers, the realtime patch seems like a set of questionable and widespread changes aimed at the needs of a very small user community. Changing spinlocks into mutexes and moving interrupt handlers into threads are fundamental changes to how the kernel does things with the potential for the creation of subtle bugs and performance problems. Reworking things and adding complexity at that level is not a task that should be undertaken without a strong need - and many developers do not see a sufficiently strong need.

There are some concerns about the performance impact of these changes. Acquiring an uncontended spinlock is a very fast operation; the rt_mutex type, with its wait queues and priority inheritance mechanisms, is bound to be slower. There is some anecdotal evidence that there is a performance hit to realtime preemption, but little in the way of real benchmarking has been done. In any case, the performance penalty should only affect users who have actually enabled the realtime preemption mode.

Finally, not everybody is convinced that the realtime preemption approach can solve the real problem: providing an ironclad guarantee that a realtime process will be scheduled within a given maximum latency. Ingo believes that this guarantee can be made by eliminating all code within the kernel which can delay a reschedule; others feel that, to make a guarantee that can truly be trusted, the entire kernel must be audited and verified. They have a point: how strong a guarantee would you want before running realtime Linux in your car's braking system?

Those who want true realtime guarantees, along with developers who simply do not want to clutter the kernel with realtime mechanisms, argue that a different approach should be taken. The most commonly suggested alternative is RTAI-Fusion, which works (at its core) by interposing a "nanokernel" between Linux and the bare hardware. The nanokernel guarantees latency by taking the lowest-level scheduling decisions out of the Linux kernel's hands; it is kept small and easy to verify. Another project taking a similar approach is Iguana, which is based on the L4 microkernel.

Since the realtime preemption patch is not being proposed for merging at this time, no decisions are likely to result from the current, lengthy discussion. If Ingo has his way, there may never be one big decision; instead, pieces of the patch will be merged if and when it makes sense.

So i'm afraid nothing radical will happen anywhere. Maybe we can have one final flamewar-party in the end when the .config options are about to be added, just for nostalgia, ok?

There may be some interesting realtime-related sessions at next month's Kernel Summit in Ottawa, however. Meanwhile, should anybody wish to plow through the entire thread on linux-kernel, here is the starting point.

Comments (9 posted)

Andrew Morton 2.6.12-rc5-mm2 ?

Domen Puncer 2.6.12-rc5-kj ?

Chris Wright Linux 2.6.11.11 ?

Con Kolivas 2.6.11-ck9 ?

Marcelo Tosatti linux-2.4.31 released ?

Marcelo Tosatti Linux 2.4.31-rc2 ?

Willy Tarreau Linux-2.4.30-hf3 ?

Benjamin Herrenschmidt EXPERIMENTAL: global suspend cleanup ?

Benjamin LaHaise x86-64: Use SSE for copy_page and clear_page ?

john stultz new timeofday i386 arch specific changes (v. B0) ?

john stultz new timeofday x86-64 arch specific changes (v. B0) ?

john stultz new timeofday i386 and x86-64 timesources (v. B0) ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc5-V0.7.47-09 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc5-V0.7.47-10 ?

Ingo Molnar TASK_NONINTERACTIVE (was: Machine Freezes while Running Crossover Office) ?

Nick Piggin improve SMP reschedule and idle routines ?

Benjamin Herrenschmidt Add some hooks to generic suspend code ?

Dipankar Sarma scalable fd management ?

john stultz new timeofday core subsystem (v. B0) ?

Daniel Walker Abstracted Priority Inheritance for RT ?

Nigel Cunningham Freezer Patches. ?

Matt Mackall Mercurial 0.5b vs git ?

Marco Costalba qgit, another git GUI viewer ?

Paul Mackerras gitk-1.1 out ?

Jens Axboe SATA NCQ support ?

Jens Axboe SATA NCQ #3 ?

Alan Cox remove non-cleanroom pwc driver compression ?

Jeff Garzik netdev-2.6, wireless queues updated ?

Pavel Machek switch pm_message_t to struct ?

dmitry pervushin SPI core ?

Alexey Dobriyan Introduce tty_unregister_ldisc() ?

Linas Vepstas : PCI Error Recovery Implementation ?

David S. Miller Locking model for NAPI drivers ?

Matt Porter RapidIO support: core ?

Matt Porter RapidIO support: ppc32 ?

Matt Porter RapidIO support: net driver over messaging ?

Jeff Garzik libata dev guide updated ?

Pavel Fedin Full NLS support for HFS (classic) filesystem ?

Christoph Lameter NUMA aware slab allocator V4 ?

Mel Gorman Avoiding external fragmentation with a placement policy Version 12 ?

John Heffner Scalable TCP ?

Michael Tokarev implement "blackhole" option for TCP and UDP ?

serue@us.ibm.com stacking get/setprocattr support patches ?

Kernel development

Brief items

Kernel release status

Kernel development news

The ongoing Philips webcam driver saga

Time to remove LSM?

Files with negative offsets

The beginning of the realtime preemption debate

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related