Kernel development [LWN.net]

Kernel release status

The current stable 2.6 release is 2.6.13.4, released on October 10. It contains a small number of security-related fixes, a fix for the elusive Sparc FPU bug, and a few other patches.

The current 2.6 prepatch is 2.6.14-rc4, announced by Linus on October 10. This will be, he says, the last -rc release before 2.6.14 comes out. It contains mostly fixes, but there's also some driver updates, a new Megaraid SAS driver, and a new gfp_t type which has caused a prototype change for many internal functions which perform memory allocations (see below). The details may be found in the long-format changelog.

There have been no -mm releases since 2.6.14-rc2-mm2 came out on September 29.

Comments (none posted)

Quote of the week

In general, if you act like I've got all the attention span of a slightly retarded golden retriever, you'll be pretty close to the mark.

-- Linus Torvalds

Comments (none posted)

Two new web sites

Those of you who were watching in the early days of Linux kernel development will remember a series of web sites which consisted of a list of kernel releases and the changes to be found in each. Maintaining such a site is a considerable amount of work, however, and no such site has been operating for some time now. That has just changed, however, with Diego Calleja's announcement of his LinuxChanges page, hosted on the KernelNewbies site. The entries go all the way back to 2.5.1 (released almost four years ago) and provide a list of relevant changes for each release. It is a useful site which, one hopes, will be kept current for a long time to come.

For those who are interested in the many projects underway in the networking subsystem, a visit to the new linux-net wiki may be in order. Visitors cannot help being struck by the amount of work which is going on in this area.

Comments (none posted)

Introducing gfp_t

Most kernel functions which deal with memory allocation take a set of "GFP flags" as an argument. These flags describe the allocation and how it should be satisfied; among other things, they control whether it is possible to sleep while waiting for memory, whether high memory can be used, and whether it is possible to call into the filesystem code. The flags are a simple integer value, and that leads to a potential problem: coding errors could result in functions being called with incorrect arguments. An occasional error has turned up where function arguments have gotten confused (usually through ordering mistakes). The resulting bugs can be strange and hard to track down.

A while back, the __nocast attribute was added to catch these mistakes. This attribute simply says that automatic type coercion should not be applied; it is used by the sparse utility. A more complete solution is on the way, now, in the form of a new gfp_t type. The patch defining this type, and changing several kernel interfaces, was posted by Al Viro and merged just before 2.6.14-rc4 came out. There are several more patches in the series, but they have evidently been put on hold for now.

The patches are surprisingly large and intrusive; it turns out that quite a few kernel functions accept GFP flags as arguments. For all that, the actual code generated does not change, and the code, as seen by gcc, changes very little. Once the patch set is complete, however, it will allow comprehensive type checking of GFP flag arguments, catching a whole class of potential bugs before they bite anybody.

Comments (5 posted)

Hard drive protection

One of the many features which will be shipped with the 2.6.14 kernel will be a driver for the "hard drive active protection system" found in some ThinkPad laptops. This system provides a set of sensors, and, in particular, an accelerometer which can report on the position of the laptop - and how quickly that position is changing. There are a number of applications of such device - such as a version of neverball played by tipping the laptop. The real purpose, however, is to enable the system to react to a fall and attempt to protect the hard drive.

The next step in the implementation of that purpose is the hard drive protection patch recently posted by Jon Escombe. This patch adds two new callbacks to the block request queue which drivers can provide:

    typedef int (issue_protect_fn) (request_queue_t *);
    typedef int (issue_unprotect_fn) (request_queue_t *);

If the driver provides these functions, the request queue, as seen in sysfs, will contain a new protect attribute. If a value is written to that attribute, the block system will interpret it as an integer number of seconds. The issue_protect_fn() will be called, and the request queue will be plugged for the indicated number of seconds. When that time expires, issue_unprotect_fn() will be called and the queue will be restarted.

The theory of operation here is that a user-space daemon will be monitoring the status of the system, as reported by the accelerometer. Should this daemon note that the laptop has begun to accelerate, it will quickly write a value to the protect attribute for each drive in the system. The drives will respond by parking the disk heads, and, in any other possible way, telling the drive to crawl into its shell and prepare for impact. Once the event has transpired, the shattered remains of the laptop can attempt to resume normal operation.

The idea seems reasonable, but block maintainer Jens Axboe has turned down the patch for now. Says Jens:

We have far too many queue hooks already, adding two more for a relatively obscure use such as this one is not a good idea.

The number of request queue callbacks is indeed large. Some of them have little to do with drivers (there's one which is called whenever disk activity happens, for example; it can be used to flash a keyboard LED in the absence of a hardware disk activity light), but others, such as the ones discussed here, are direct requests to the underlying block driver. The use of callbacks seems a little redundant in this situation, given that the request queue is, fundamentally, a mechanism for conveying commands to block drivers. The right solution might thus be to use the request queue to carry commands beyond those requesting the movement of blocks to and from the drive.

To an extent, the request queue is already used this way. Packet commands, ATA task file commands, and power management commands can be fed to drivers through the queue. In each case, the flags field of struct request is used to indicate that something special is being requested. The use of flags in this way is getting a little unwieldy, however, leading to the consideration of a new approach.

That approach, as seen in a patch held by Jens, is to add a new field (cmd_type) to struct request which indicates the type of command embodied by each request. Currently-anticipated types include packet commands, sense requests, power management commands, flush requests, driver-specific special requests, and Linux-specific, generic requests. Oh, and the occasional request to move a disk block in one direction or the other. The addition of cmd_type turns struct request into a generic carrier of commands to a disk drive.

With this mechanism in place, the "brace yourself, we're falling!" message becomes just another Linux-specific block request type. When such an event happens, the kernel need only place one of those messages on the queue - preferably at the head of the queue - and call the driver's request() function. The driver can then prepare the drive for the coming catastrophe and plug the queue itself. No additional callbacks required.

This approach does involve some significant changes to the block layer, however, and would include a driver API change. So it is not likely to take a quick path into the kernel. The hard drive protection mechanism, which will require the new API, thus looks likely to wait in line for a while yet.

Comments (15 posted)

Adaptive file readahead

Readahead is a technique employed by the kernel in an attempt to improve file reading performance. If the kernel has reason to believe that a particular file is being read sequentially, it will attempt to read blocks from the file into memory before the application requests them. When readahead works, it speeds up the system's throughput, since the reading application does not have to wait for its requests. When readahead fails, instead, it generates useless I/O and occupies memory pages which are needed for some other purpose.

The current kernel readahead implementation uses a window 128KB in length. When readahead seems appropriate, the kernel will speculatively bring in the next 128KB of file data. If the application continues to read sequentially through that data, the next 128KB chunk will be brought in when the application is part-way through the first one. This implementation works, but Wu Fengguang thinks that it can be made better.

In particular, Wu thinks that the fixed readahead window size should, instead, adapt to both the application's behavior and the global state of the system. His adaptive readahead patch is an implementation of this thought. It is a work of daunting complexity, but the core ideas are reasonably straightforward.

The adaptive readahead patch tries to balance two constraints: readahead should be performed aggressively, but not to the point that the system starts thrashing or readahead pages get recycled before the application uses them. Every time a readahead decision is to be made for a specific file, the adaptive code looks at how much memory is available for readahead and how quickly the application has been working through the file. If memory is tight, or if the disk holding the file is congested, readahead will not be performed at all.

The code also looks at the pressure on the inactive page lists and tries to figure out whether any readahead pages are in danger of falling off that list and being reclaimed. In that situation, the readahead pages will be moved back up the list, keeping them in memory for a bit longer. This "rescue" operation helps to keep previous readahead work from being wasted; since it is only performed when the application consumes data from the file, it will not happen if the reading process has stalled entirely. But, when the application is working through the data, it will get another chance to benefit from readahead which has already been performed. No more readahead will be started in that situation, however.

If, instead, the application is making use of its readahead pages and the memory is available, the readahead window can grow up to 1MB. For streaming media or data processing applications which work their way sequentially through large files, this enlarged window can lead to significant performance gains.

In fact, Wu claims results which are "pretty optimistic." They include a 20-100% improvement for applications doing parallel reads, and the ability to run 800 1KB/sec simultaneous streams on a 64MB system without thrashing. The page cache hit rate is claimed to be 91%, which is quite good.

The adaptive readahead patch might, thus, be a worthwhile addition to the Linux memory management subsystem. There has been little discussion (none, actually) of the patch on the list, however. Complicated patches working in an obscure corner of memory management do not receive the same level of review as, say, new filesystems, it would seem. In any case, a patch of this nature will require a good deal of testing before it can be considered for any sort of merge. So, while adaptive readahead may indeed make its way into the mainline, it's not something to expect to see in the very near future.

Comments (4 posted)

Linus Torvalds Linux v2.6.14-rc4 ?

Ingo Molnar 2.6.14-rc4-rt1 ?

Alexey Dobriyan 2.6.14-rc4-kj1 ?

Greg KH Linux 2.6.13.4 ?

Matt Mackall 2.6.13.3-tiny1 for small systems ?

Con Kolivas 2.6.13-ck8 ?

linas ppc64: Full sequence of PCI Error recovery patches ?

Eric Dumazet i386 spinlocks should use the full 32 bits, not only 8 bits ?

david singleton robust futex patch for 2.6.14-rc3-rt13 ?

Alan Cox 1/2 EDAC (Core Code) ?

Alan Cox EDAC 2/2 : Drivers ?

Al Viro gfp flags annotations - part 1 ?

Junio C Hamano GIT 0.99.8b ?

sugita LKST v2.3.1 is released! ?

Ananth N Mavinakayanahalli Kprobes: scalability enhancements ?

Jens Axboe Block device io tracing ?

Pete Zaitcev usb: drivers/usb/storage/libusual ?

Jeff Garzik libata queue updated ?

Jaroslav Kysela [ALSA] 1.0.10rc2 release ?

David Howells Keys: Add request-key process documentation ?

Michael Kerrisk man-pages-2.08 is released ?

Miklos Szeredi atomic create+open ?

Jon Escombe Hard disk protection revisited ?

WU Fengguang Adaptive read-ahead ?

Janak Desai New System call unshare (try 2) ?

Con Kolivas vm - swap_prefetch-15 ?

Nick Piggin core remove PageReserved ?

Mel Gorman Fragmentation Avoidance V17 ?

Harald Welte x_tables, first take ?

Chris Wright Keys: Add LSM hooks for key management [try #3] ?

Chris Wright LSM update, another missing hook ?

Stephen Hemminger linux-net wiki ?

Luke Kenneth Casson Leighton Linux Visionaries Mailing List ?

Stephen Hemminger iproute2 (051007) ?

Diego Calleja Documenting kernel changes ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

Two new web sites

Introducing gfp_t

Hard drive protection

Adaptive file readahead

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous