Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch remains 2.6.24-rc3. Fixes continue to flow into the mainline git repository at a relatively high rate; 2.6.24-rc4 must be due sometime in the very near future.

The current -mm tree is 2.6.24-rc3-mm2. Recent changes to -mm include the new timerfd API (see below), a number of driver core changes, a per-process capability bounding set feature, and an updated version of the SMACK security module.

The current stable 2.6 kernel is 2.6.23.9, released on November 26. There are a couple dozen or so important fixes in this update.

For older kernels: 2.6.22.14 was released on November 21.

Comments (none posted)

Quote of the week

The Linux kernel requires that any needed documentation accompany all changes requiring said documentation -- part of the source-code patch must apply to the Documentation/ directory.

-- Donnie Berkholz engages in some wishful thinking

Comments (none posted)

Tightening symbol exports

By Jonathan Corbet
November 27, 2007

The kernel's loadable module mechanism does not give modules access to all parts of the kernel. Instead, any kernel symbol which is intended to be usable by loadable modules must be explicitly exported to them via one of the variants of the EXPORT_SYMBOL() macro. The idea behind this restriction is to place limits on the reach of modules and to provide a relatively well-defined module API. In practice, there have been few limits placed on the exporting of symbols, with the result that many thousands of symbols are available to modules. Loadable modules can access many of the obviously useful symbols (printk(), say, or kmalloc()), but they can also get at generic symbols like edd, tpm_pm_suspend(), vr41xx_set_irq_trigger(), or flexcop_dump_reg().

There are reasons for the concern over excessive symbol exports felt by some developers. Wrongly exported symbols can lead module authors to use incorrect interfaces; for example, the exporting of sys_open() is an active inducement for developers to open files directly inside the kernel, which is almost never a good idea. But such symbols, once exported, can prove hard to unexport. While the official line says that the internal kernel API can change at any time, the truth of the matter is that at least some developers are reluctant to break external modules when that can be avoided.

A more timely example would be init_level4_pgt, a low-level symbol exported only by the x86_64 architecture. The current -mm tree removes that export, breaking the proprietary NVIDIA module in the process. Andrew Morton describes this removal as "our clever way of reducing the tester base so we don't get so many bug reports." While many developers make a show of not caring about binary-only modules, there is still a good chance that this particular export removal (of a symbol which should not really be available globally) may not make it into the mainline as a result of this breakage.

The end result of all this is that there has long been interest in somehow cleaning up the modular API, though there have not been a whole lot of people who have put a lot of time toward that end. Occasionally somebody has remarked upon one piece of low-hanging fruit: symbols which are exported only to make it possible to modularize other bits of mainline kernel code. One example is a whole set of TCP stack symbols (things like __tcp_put_md5sig_pool()) which have exactly one user: the IPv6 module. Restricting these special-purpose exports has the potential to significantly narrow the modular API without making it harder to modularize the mainline.

Andi Kleen's module symbol namespace patch is meant to enable just this sort of narrowing of the API. With this patch, symbols can be exported into specific "namespaces" which are only available to modules appearing on an associated whitelist. In a sense, the term "namespace" is a poor fit here; there is still a single, global namespace within which all exported symbols must be unique. These "namespaces" are more like special exclusion zones containing symbols which are not globally accessible. They work like GPL-only exports, which also restrict the availability of symbols to a subset of modules.

To create a restricted export, an ordinary EXPORT_SYMBOL() declaration is changed to:

    EXPORT_SYMBOL_NS(namespace, symbol);

Where namespace is the name of a restricted symbol namespace. So, going back to the TCP example, Andi's patch contains a number of changes like:

    -EXPORT_SYMBOL(__tcp_put_md5sig_pool);
    +EXPORT_SYMBOL_NS(tcp, __tcp_put_md5sig_pool);

Note that there is no _GPL version; any symbol which is exported into a specific namespace is treated as GPL-only by default.

The other part of the equation is to enable access to a namespace. That is done with:

    MODULE_NAMESPACE_ALLOW(namespace, module);

Such a declaration (which must appear in a module exporting symbols into the namespace) says that the given module can access symbols in that namespace. Andi's patch creates three namespaces (tcp, tcpcong for congestion control modules, and udp), removing about 30 symbols from the global namespace.

A number of developers welcomed this patch, seeing it as a step forward in the rationalization of the loadable module API. It is seen as a way to prevent out-of-tree modules from using symbols which they should not be using. It also reduces the number of interfaces which must be kept stable in situations (enterprise kernels, for example) where changes are not allowed. And, finally, the symbol namespaces offer the ability to organize exports somewhat and document who the intended users are.

There is a bit of dissent, though. In particular, Rusty Russell fears that the patch adds unneeded complexity and threatens to make life harder for out-of-tree developers for little (if any) gain. Says Rusty:

For example, you put all the udp functions in the "udp" namespace. But what have we gained? What has become easier to maintain? All those function start with "udp_": are people having trouble telling what they're for?

If you really want to reduce "public interfaces" then it's much simpler to mark explicitly what out-of-tree modules can use.

Herbert Xu has similar concerns:

These symbols are exported because they're needed by protocols. If they weren't available to everyone then it would be difficult to start writing new protocols....

So based on the network code at least I'm kind of starting to agree with Rusty now: if a symbol is needed by more than one in-tree module chances are we want it to be exported for all.

While these voices seem to be in the minority, they still carry quite a bit of weight. So your editor is unwilling to make any sort of guess as to whether this patch will be merged, or in what form. The desire to clean up the modular API is unlikely to go away, though, so, sooner or later, something is likely to happen.

Comments (12 posted)

kmemcheck

By Jonathan Corbet
November 27, 2007

Using uninitialized memory can lead to some seriously annoying bugs. If you are lucky, the kernel will crash with the telltale slab poisoning pattern (0x5a5a5a5a or similar) in the traceback. Other times, though, something more subtly wrong happens, forcing a long hunt for the stupid mistake. Wouldn't it be nicer if the kernel could simply detect references to uninitialized memory and scream loudly at the time?

The kmemcheck patch recently posted by Vegard Nossum offers just that functionality, though, perhaps, in a somewhat heavy-handed manner. A kernel with kmemcheck enabled is unlikely to be suitable for production use, but it should, indeed, do a good job at finding code using memory which has not yet been set to a useful value.

Kmemcheck is a relatively simple patch; the approach used is, essentially, this:

Every memory allocation is trapped at the page-allocator level. For each allocation, the requested order is increased by one, doubling the size of the allocation. The additional ("shadow") pages are initialized to zero and kept hidden.
The allocated memory is returned to the caller, but with the "present" bit cleared in the page tables. As a result, every attempt to access that memory will cause a page fault.
Once the fault happens, kmemcheck (through some ugly, architecture-specific code) determines the exact address and size of the attempted access. If the access is a write, the corresponding bytes in the shadow page are set to 0xff and the operation is allowed to complete.
For read accesses, the corresponding shadow page bytes are tested; if any of them are zero, the code concludes that the read is trying to access uninitialized data. A stack traceback is printed to enable the developer to find the location where this access is happening.

As should be evident, running with kmemcheck enabled will have certain performance impacts. Taking a page fault on every access to slab memory just cannot be fast. Doubling the size of every allocation will impose costs of its own, including the cache effects of simply working with twice as much memory. But that is a cost which can be paid when the kernel is being run in a debugging mode.

Vegard has posted some sample output which shows how the system responds to reads from uninitialized memory. If this output is to be believed, access to unset memory is not an especially uncommon occurrence in current kernels. If some of references flagged here, once tracked down, turn out to be real bugs, the kmemcheck patch will have earned its keep, even if it never finds its way into the mainline.

Comments (8 posted)

System call updates: indirect(), timerfd(), and hijack()

By Jonathan Corbet
November 28, 2007

Last week's discussion of the proposed indirect() system call ended with some complaints from developers on the ugliness of the interface. Since then there has been some talk about system call interfaces in general, but not a whole lot of ideas for how indirect() could be done better.

The leading alternative would be that pushed by H. Peter Anvin: rather than use indirect() to extend a system call, simply make a new system call with the desired additional parameters. Then, usually, the old implementation can be replaced with a simple stub which calls the new version with the default values for the new parameters. It is a simple approach which easily maintains binary compatibility with very little runtime cost. Since there is no particular shortage of system call numbers, this is a process which could go on for a long time.

The management of increasing numbers of system calls does impose a cost, though; each one of those system calls is a user-space API which cannot ever be broken. The indirect() approach, instead, does not add more system calls. As long as the addition of parameters (with default values of zero) is done with care, avoiding API problems should be relatively easy to do.

There are also limits on how many parameters can be easily passed to system calls; on most systems, that limit is around six. Any system call requiring more arguments must already do uncomfortable things with indirect blocks. Creating new system calls with additional parameters will create more cases where this sort of indirect parameter handling is required. So the approach used by indirect() will find itself being used, in some form, anyway.

The key argument, though, still appears to be the syslet/threadlet mechanism. The ability to make any system call asynchronous has a lot of appeal, but doing so requires some additional information - a place to store the result of the call, if nothing else. Asynchronous system calls, in Linux, are, for all practical purposes, a type of indirect call. The proposed indirect() interface looks like it should be able to accommodate asynchronous calls nicely - though the precise API has not, yet, been nailed down.

As a result of all this, chances are that some form of indirect() will find its way into the mainline - though there is still time for somebody to come up with a better idea.

Meanwhile, the last time timerfd() was discussed here, it had been disabled in the 2.6.23 kernel as a result of complaints about its interface. Since then, little has happened with timerfd(), with the result that it will almost certainly not be present in 2.6.24 either. Some work has been done with this system call, though, and a new API proposal has been posted. This version has three system calls, the first of which is timerfd_create():

    int timerfd_create(int clockid, int flags);

The clockid argument tells the system which clock should be used: CLOCK_MONOTONIC or CLOCK_REALTIME. The flags argument is a recent addition; it is currently unused and must be zero. It was added on the assumption that somebody, somewhere, will always want some sort of behavior modification and one might as well avoid the need for an indirect version while it's easy. The return value from timerfd_create() is a file descriptor which can be passed to read() or any of the poll() variants. But, first, the timer should probably be programmed with:

    int timerfd_settime(int fd, 
                        int flags,
		        const struct itimerspec *timer,
		    	struct itimerspec *old_timer);

Here, fd is a file descriptor obtained from timerfd_create(), flags contains TFD_TIMER_ABSTIME if the timer is being set to an absolute time, and timer is the expiration time for the timer. If old_timer is not NULL, the location pointed to will be set to the previous value of the timer.

It is also possible to query the value of the timer with:

    int timerfd_gettime(int fd, struct itimerspec *timer);

The value returned in *timer will be the current setting of the timer associated with fd.

There's not been a whole lot of comments on this version of the API, so something very similar to it will probably be merged. It would normally be considered to be too late to put a change like this into 2.6.24, but the 2.6.24-rc3-mm2 patch log says "Probably 2.6.24?". So one never knows. If this change is not merged soon, it will almost certainly become available for 2.6.25.

Finally, the hijack() system call continues to be developed on relatively quiet kernel subsystem lists. This call (described here in October) behaves much like clone() in that it creates a new process. Unlike clone(), however, hijack() causes the new process to share resources with a specified third process rather than with the parent. Its main reason for existence is to make it easy to enter different namespaces.

The hijack() interface remains almost unchanged:

    int hijack(unsigned long clone_flags, int which, int id);

The specified id value is interpreted according to which, which now has three possible values:

HIJACK_PID says that id is a process ID; the newly-created process will share resources (including namespaces) with the indicated process.
HIJACK_CG says that id is an open file descriptor for the tasks file in a target control group. In this case, the kernel will find a process within that control group and use it as the source for resources and namespaces.
HIJACK_NS is the newest option; like HIJACK_CG, it is an open file descriptor indicating a control group. In this case, though, only the control group itself and any associated namespaces will be inherited by the new process. This version is intended for use when entry into an empty control group (where there are no processes to inherit from) is desired.

This new system call still has not seen any exposure on linux-kernel; it may well not survive its first experience there in its current form. If nothing else, a name change (to something which is more descriptive of the real function and, preferably, which does not put users onto intelligence agency watch lists) may well be called for. But a full container implementation on Linux will clearly need some sort of enter_container() system call at some point.

Comments (1 posted)

Andrew Morton 2.6.24-rc3-mm1 ?

Andrew Morton 2.6.24-rc3-mm2 ?

Greg Kroah-Hartman Linux 2.6.23.9 ?

Steven Rostedt 2.6.23.9-rt12 ?

Greg Kroah-Hartman Linux 2.6.22.14 ?

Roland McGrath ptrace: arch_has_single_step ?

Christoph Lameter Per cpu code simplification ?

Paul Mundt nommu: Add new vmalloc_user() and remap_vmalloc_range() interfaces. ?

Geert Uytterhoeven PS3 notification device patches for 2.6.25 ?

Ulrich Drepper sys_indirect system call ?

Davide Libenzi Timerfd v2 - new timerfd API ?

Davide Libenzi Timerfd v3 - new timerfd API ?

Steven Rostedt New RT Balancing version 4 ?

Andi Kleen [1/9] Core module symbol namespaces code and intro. ?

Srivatsa Vaddagiri sched: group scheduler related patches (V3) ?

Mathieu Desnoyers Linux Kernel Markers - Support Multiple Probes ?

Vegard Nossum kmemcheck: trap uses of uninitialized memory (v2) ?

ian Support for Toshiba TMIO multifunction devices ?

Haavard Skinnemoen dmaengine: Slave DMA interface and example users ?

Anton Vorontsov OF-platform PATA driver ?

Konrad Rzeszutek Add iSCSI IBFT Support (v0.3) ?

Love, Robert W Open-FCoE - Fibre Channel over Ethernet Project ?

Pavel Emelyanov [PATCH (resend)][DOCUMENTATION] The namespaces compatibility list doc ?

Michael Kerrisk man-pages-2.68 is released ?

Daniel Drake Documentation about unaligned memory access ?

Greg KH New kobject/kset/ktype documentation and example code ?

Mel Gorman Use two zonelists per node instead of multiple zonelists v10 ?

Templin, Fred L ipv6: RFC4214 Support (v2.4) ?

Ryousei Takano NET_SCHED: PSPacer qdisc module ?

Phil Oester Per-conntrack timeout target v3 ?

Hideo AOKI UDP memory accounting and limitation (take 9) ?

Tetsuo Handa Add packet filtering based on process's security context. ?

Serge E. Hallyn [PATCH 1/1] capabilities: introduce per-process capability bounding set (v8) ?

Pavel Emelyanov Sysctl shadow management ?

KAMEZAWA Hiroyuki [PATCH][for -mm] per-zone and reclaim enhancements for memory controller take 3 [0/10] introduction ?

Mark Nelson namespaces: introduce sys_hijack (v10) ?

Zoltan Sogor Add LZO compression support to cryptoapi ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

Tightening symbol exports

kmemcheck

System call updates: indirect(), timerfd(), and hijack()

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous