Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.19-rc5, released by Linus on November 7. It contains another pile of fixes, many of them in architecture-specific code; the long-format changelog has the details. Linus says "there may be a -rc6, but maybe we don't even need one."

Adrian Bunk calls those "famous last words" in his 2.6.19-rc5 known regressions list.

The current -mm tree is 2.6.19-rc5-mm1. Recent changes to -mm include the latest kevent code (see below), the kernel virtual machine patch set, and some big updates to the high-resolution timer and dynamic tick code - which still has some problems.

The current stable 2.6 release is 2.6.18.2, released on November 3. Once again, quite a long list of patches has been merged into this release.

On the 2.6.16 front, 2.6.16.30 was released on November 3, followed by 2.6.16.31 on November 7. Between these two releases quite a few bugs have been fixed, including several which are security-related.

For 2.4 users, 2.4.34-pre5 came out on November 4. The first 2.4.34 release candidate is expected before too long.

Comments (none posted)

Kernel development news

OSDL to fund a kernel tech writer

It took a long time to come about, but it has happened: OSDL has pulled together the money to fund a technical writer to work on kernel documentation for a year. The job posting is available on the net for anybody who might be interested in applying.

Full Story (comments: 25)

Task watchers

One of the more complicated core kernel functions is copy_process(), in kernel/fork.c. This routine is the heart of the fork() and clone() system calls; it must create a coherent copy of a running process, bearing in mind the various clone flags which are present. There are sixteen different goto labels for error exits. This is clearly a place where a lot of things can go wrong.

It is also an operation of interest to many other kernel subsystems. A look at copy_process() reveals hooks for task delay accounting, auditing, the process fork connector, SYSV semaphore undo information management, NUMA memory policy enforcement, cpuset maintenance, keyring management, and more. Many of these subsystems want to know about other events in the process lifecycle as well, with the result that hooks are placed all over the process code. It might just be nice to have a cleaner solution to the problem of learning about process-related events.

That cleaner solution would appear to be present in the form of Matt Helsley's task watchers patch set, currently in its second major iteration. This patch takes an interesting approach to providing what is essentially just another notifier interface in order to minimize overhead in a performance-critical part of the kernel.

In this patch, a "task watcher" is a function which is notified whenever an interesting process event takes place. Watchers have this prototype:

    int my_watcher(unsigned long info, struct task_struct *tsk);

When the watcher function is called, info will have additional information for the specific event, and tsk points to the process generating the event. Arranging for a task watcher to be called is a simple matter of adding a declaration like the following:

    task_watcher_func(event, function);

Where event is the event of interest, and function is the task watcher function to be called in response to that event. The possible events are:

init: a process is first created; info is the set of flags passed to clone().
clone: a process forks; info is the set of clone() flags. Note that this watcher appears to be called with the child process; it differs from init in that it is called toward the end of copy_process(), when creation of the new process is complete.
exec: a process executes a new program; info is zero.
uid: a process changes its real or effective UID; info is zero.
gid: a process changes its real or effective GID; info is zero.
exit: a process dies; info is the exit code.
free: a process's task structure is being freed; info is the exit code.

The task_watcher_func() macro creates a pointer to the watcher function in a special ELF section. There is a separate section for each watched-for event; when such an event is signaled, the watcher code simply iterates through each function found in the relevant executable section. There are a couple of implications resulting from this mechanism: task watchers exist for the life of the system (they cannot be registered and unregistered), and they cannot be located in loadable modules (though this restriction will eventually go away).

One might well wonder why things were done this way, rather than using a simple notifier list. Your editor wondered, and asked Mr. Helsley about it. The problem is that process creation is a performance-critical part of the kernel, and any change which increases process fork time tends to get a lot of scrutiny. Fork times are measured by a number of benchmarks; quick process creation is also important in fork-heavy loads. Since kernel compilation can require a lot of forks, there is an especially strong incentive to keep it fast.

If a notifier list is used with watchers, some sort of locking is required to keep that list from being corrupted when watchers come and go. The separate ELF sections, instead, are read-only structures created at kernel build time. So they impose less overhead on the process lifecycle and, thus, are less likely to bother kernel developers who, perhaps, are not really interested in the watcher functionality.

Comments (none posted)

This week's version of the kevent interface

The proposed kevent interface was last covered here in August. This new API, which seeks to provide a single interface for applications to received events of interest, has been under development for the better part of a year now. It continues to evolve, so, in celebration of the version 23 kevent patch, another look is called for.

Parts of the interface remain relatively stable. So, the main multiplexer system call remains:

    int kevent_ctl(int fd, unsigned int cmd, unsigned int num,
                   struct ukevent *arg);

The functions performed by this call are reduced in number, however. It is no longer used to create the kevent file descriptor in the first place; instead, an open of /dev/kevent is called for. But kevent_ctl() is still the place to add events of interest, and to remove and modify them.

The synchronous interface for waiting for events is also pretty much as it has been for a little while:

    int kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
                          __u64 timeout, struct ukevent *buf, 
			  unsigned flags);

This system call will wait until at least min_nr events are ready for consumption, then copy up to max_nr completed events into buf. The call will return early, however, if timeout nanoseconds pass before min_nr events are signaled. The current documentation for kevents says that an indefinite wait can be had by passing -1 for timeout - slightly strange, given that timeout is an unsigned quantity. It would not be surprising to see some sort of KEVENT_WAIT_FOREVER value defined for that purpose instead.

The biggest changes can be found in the kevent ring buffer code which, last time we looked, was rather awkward to use. The previous implementation also placed the ring buffer in nailed-down kernel memory, potentially opening the system up to denial of service problems. So, in the new implementation, the ring buffer is kept entirely in user space. The application simply allocates an array of the desired size with the following type:

    struct kevent_ring
    {
	unsigned int		ring_kidx;
	struct ukevent		event[0];
    };

The actual number of events to be stored in the ring is determined by the application. The kevent subsystem must be told about this ring with:

    int kevent_ring_init(int fd, struct kevent_ring *ring, 
                         unsigned int num);

where num is the number of ukevent structures in the ring. This call will remember the ring's address and size, and set ring_kidx - the index of the entry where the kernel will store the next completed event - to zero.

There are a few things to be aware of when working with the kevent ring. One is that there is no place in this data structure to track which event the application should consume next; the application must store that index elsewhere. There also appears to be no way to disconnect or resize the ring buffer without simply closing the event file descriptor and starting over; an attempt to replace one ring with another will fail. Finally, the application must tell the kernel to put events into the ring with:

    int kevent_wait(int fd, unsigned int num, __u64 timeout);

This system call will wait until at least one event is available, then copy up to num events into the ring buffer. Once the events are copied, the kernel considers them to be consumed and will forget about them (or requeue them if the event so requests). The application can work through the events at leisure - stopping before hitting the current ring_kidx value - with no further system calls required.

The current API seems to have made most of the people who are paying attention happy - though it has been a little while since Ulrich Drepper, an important player, has chimed in. In the past, he has been unhappy about the timeout parameter (preferring that the interface use an absolute timespec value rather than a relative value). Ulrich has also suggested that the blocking system calls could use a version which specifies an event mask, much like the recently added ppoll() and pselect() system calls. He points out that, while it is possible to receive signals as kevents, some applications will certainly still use traditional signals, with their traditional atomicity problems.

So there may be a few remaining issues to take care of before the kevent API is merged into the mainline kernel - and consequently set in stone. But there is apparent progress in that direction, and the number of developers showing interest in this API appears to be on the increase. It may not be too many more kernel cycles before Linux has a unified event interface of its very own.

Comments (2 posted)

Sparse gets a maintainer

The "sparse" utility has long been one of Linux's best-kept secrets. It is a static analysis tool which can find a wide variety of bugs in the kernel code base; sparse is a useful tool, but it can be surprisingly hard to find. It has never had a web page, and almost no distributions package it. Interested users must, instead, track down the git tree or Dave Jones's snapshot directory.

Sparse was originally written by Linus Torvalds, but he has not done much with it for a while, and he recently suggested that somebody else should take it over:

Anyway, I suspect it would be better if people didn't consider me the maintainer for sparse, simply because it does the things I really cared about, and as a result I'm not really very active.

As a result of this discussion, sparse has a new maintainer: Josh Triplett. Josh started things off with sparse 0.1, the first-ever sparse release with a version number. He has set up a new git tree for sparse, and, even, a sparse web page.

Josh was kind enough to answer some questions posed by your editor. It turns out that he has been working with sparse for a while; it was part of his PhD work, where he enhanced it to be able to verify proper use of the read-copy-update (RCU) primitives. That work continued at IBM over the summer, where he was able to work on RCU verification with Paul McKenney.

As a result, his first priority for sparse in the near future is the continuation of the RCU work. This effort is also expanding into locking verification in general; some of the necessary annotations and resulting fixes have gone into the 2.6.18 and 2.6.19-rc kernels. Josh also plans to work on the elimination of false positives and on noise reduction in general. Then, there's various patches from other developers which have been floating around for a while and really need to be merged into the sparse mainline.

In terms of project management, Josh says:

I plan to continue making regular Sparse releases, and I'd like to get Sparse packaged in various distributions, at least in their "experimental" sections or equivalent. Any potential distribution packagers, feel free to join the linux-sparse list, and let me know what I can do to help or to get things going more smoothly.

Getting sparse into distributions could only help increase its use - and bring about a corresponding reduction in bugs in shipped code. This will be especially true if Josh succeeds in another one of his goals: expanding sparse usage beyond the kernel into user-space projects. X.org seems like it could be an early sparse adopter.

Longer-term, Josh wants to look at more advanced techniques which can look at larger chunks of a program and find potential bugs. Part of this effort will require attracting other researchers interested in static analysis to the sparse platform. Says Josh:

I feel that several classes of bugs exist in the Linux kernel and in userspace code which simply should not exist, because the tools exist to find and eliminate almost all of them. This includes bugs like "scheduling while atomic", __init-related bugs, errors on error paths, and many locking-related bugs.

One can only imagine that free software users all over are wishing Josh the best of luck in his effort to track down and get rid of all those unnecessary bugs.

Comments (8 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.19-rc5 ?

Andrew Morton 2.6.19-rc5-mm1 ?

Andrew Morton 2.6.19-rc4-mm2 ?

Chris Wright Linux 2.6.18.2 ?

Adrian Bunk Linux 2.6.16.31 ?

Adrian Bunk Linux 2.6.16.31-rc1 ?

Adrian Bunk Linux 2.6.16.30 ?

Willy Tarreau Linux 2.4.34-pre5 ?

Architecture-specific

Haavard Skinnemoen Atmel SPI driver and related AVR32 changes ?

Haavard Skinnemoen Atmel MACB ethernet driver for avr32 ?

Core kernel code

Christoph Lameter Scheduler: Locking Optimization during load balance ?

Christoph Lameter sched_domain balancing via tasklet V3 ?

Matt Helsley Task Watchers v2: Introduction ?

Evgeniy Polyakov kevent: Generic event handling mechanism. ?

Development tools

Junio C Hamano GIT 1.4.3.4 ?

Josh Triplett ANNOUNCE: Sparse 0.1 - first release version of Sparse; new maintainer ?

Device drivers

Jonathan Corbet Marvell 88ALP01 and OmniVision OV7670 drivers ?

Douglas Gilbert RFC: SCSI Generic version 4 interface ?

Filesystems and block I/O

Mikulas Patocka New filesystem for Linux ?

NeilBrown md: udev notification, raid5 read improvements etc ?

Memory management

Christoph Lameter Avoid allocating during interleave from almost full nodes ?

Networking

John W. Linville New stuff in wireless-dev, wireless developers please pull... ?

David Miller net-2.6.20 is up... ?

Security-related

Serge E. Hallyn security: introduce file posix caps ?

Virtualization and containers

Avi Kivity kvm howto ?

Avi Kivity KVM: Kernel-based Virtual Machine (v4) ?

Serge E. Hallyn [PATCH 0/4] uid_ns: introduction ?

Miscellaneous

Pablo Neira Ayuso conntrackd-0.9.1 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>