Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.8-rc4, which was announced by Linus on August 9. There was, he says, just a little too much new stuff in there for him to have been comfortable putting it out directly as 2.6.8. That new stuff includes a replaced 586-optimized AES implementation, a new internal infrastructure for handling file positioning and seekability (see below), a sysctl API change, and some architecture updates. See the long-format changelog for the details.

Linus's BitKeeper tree contains a big Prism54 driver update and various fixes. Things are stabilizing for an official 2.6.8 release which may have happened by the time you read this.

The current prepatch from Andrew Morton is 2.6.8-rc4-mm1. Recent additions to -mm include a mechanism for gathering CPU scheduler statistics, the "mlock as user" patch (covered briefly last week), some asynchronous I/O fixes, version 17 of the wireless extensions API, some read-copy-update enhancements, resident set size ulimit support (see below), in-kernel cryptographic keyring management, a number of architecture updates, and lots of fixes. The staircase scheduler has been dropped from -mm for now ("it used up its time slice") in favor of a simpler patch which simply disables the use of the expired array. The quest for the best way to improve the scheduler continues.

The current 2.4 kernel is 2.4.27, released by Marcelo on August 7. 2.4.27 contains fixes for a handful of security problems, some new crypto algorithms, a big serial ATA update, TCP Vegas and BIC backports from 2.6, and vast numbers of fixes.

Comments (none posted)

Safe seeks

The lseek() system call allows user space to move the current read/write position within a file. It is not an operation which normally attracts attention, since its full effect is, normally, to change an internal integer index. It turns out, however, that lseek() has been poorly implemented in many parts of the kernel. The recent vulnerability discovered by Paul Starzetz has highlighted the problem, with the result that the internal handling of lseek() is changing significantly for 2.6.8.

Seeking within a file is straightforward; it is just a matter of changing the current position index inside the kernel. The situation gets a little murkier, however, when dealing with things that are not regular files. Virtual files implemented by the kernel can often be seeked in a meaningful way, if it's done carefully; the same is true of a very small number of physical devices. For most devices, however, along with objects like network connections, seeking makes no sense at all.

The default behavior for lseek() is to change the internal offset pointer and return success; if code for the the underlying object (device, network connection, file, etc.) has not provided its own llseek() method, the call appears to succeed. Implementation of a non-seekable device requires an explicit action, instead, to ensure that user space is given the proper error. The traditional way of handling lseek() within a device driver is to include a simple llseek() method which looks like this:

    loff_t my_llseek(struct file *file, loff_t offset, int whence)
    {
        return -ESPIPE;    /* Not seekable */
    }

More recent kernels (2.4 and beyond) also provide a no_llseek() helper which looks like the above.

This technique works, as long as the author bothers to do things this way. In some cases, this little step gets skipped, and the resulting object appears seekable even though it is not. Even when this method is provided, however, it is not a complete solution; the pread() and pwrite() system calls, which specify a specific offset for the operation, involve seeks. Objects within the kernel do not see these calls directly; they just look like regular read() and write() calls. This works because the internal methods for these calls are always passed the offset to use.

What this means is that, for a non-seekable object, every read() or write() method should include a test like this:

    ssize_t my_read(struct file *filp, char *buf, size_t count,
    		    loff_t *ppos)
    {
    /* ... */
    if (ppos != &filp->f_pos)
        return -ESPIPE;
    /* ... */
    }

This test works because, for normal read() and write() calls, the ppos pointer goes directly to the offset (f_pos) stored in the file structure. If ppos points elsewhere, it means that a pread() or pwrite() call has been made, and an error should be returned. These tests are simple, but they are bits of boilerplate code which must be added to the implementation of all non-seekable objects, and not all authors bother. After all, for most uses, the code works just fine without.

The above code also forces widespread knowledge of the contents of the file structure and how position information is passed to read() and write() methods. For sysctl methods, things are even worse: there is no position passed in, so there is no alternative to getting it from the file structure.

Finally, there are some interesting race conditions associated with the handling of file offsets. Often a device driver will test a position for validity, sleep (while waiting for device operations or user-space copies), then change the offset. But that offset could have changed in other ways during the sleep, leaving its final value in an indeterminate state.

In response to all this, Linus has thrown together a set of patches changing the way seeks are handled inside the kernel. These patches have found their way into 2.6.8-rc4, but they were not posted separately on any open mailing lists first. The first patch adds a new FMODE_LSEEK bit to the file structure, so that the virtual filesystem (VFS) code knows which files are seekable and which are not. The idea is to move all tests for illegal seeks to the core VFS code. A second patch adds separate mode bits for pread() and pwrite(); as it turns out, files implemented with the seq_file interface are seekable, but do not support those two calls.

A pair of patches then followed to make use of the new tests in the VFS core. The nonseekable_open() helper was added to enable drivers (and other code) to clear the new bits and mark an object as not being seekable. It is meant to be called in the corresponding open() method. Then came changes to a large number of drivers making them use the new infrastructure; the net result was the removal of quite a bit of code.

It's worth noting that this patch represents a change in how device drivers should be written, but the actual API has not been changed in any incompatible ways. Unmodified drivers will still work - at least, as well as they did before. The sysctl change does involve an API change, however. All sysctl methods now have the offset passed in explicitly as a parameter; they should no longer go digging through the file structure for that information. Unmodified sysctl implementations will no longer compile.

The final step is to change how the read() and write() system calls are implemented. They now create a copy of the f_pos field and pass that to the appropriate methods, and copy the result back afterward. So those methods never work with f_pos directly, regardless of how they are invoked. As a result of all this work, the handling of seeking has become simpler and more robust.

Comments (2 posted)

Simple resident set size limits

One of the problems which can afflict any virtual memory system is a process which expands to fill all of memory. All it takes is, say, a quick OpenOffice session, and everything else running on the system finds itself shoved into a corner of memory and pushed out onto swap. Avoiding this problem is a simple matter of limiting the amount of physical memory that any given process can occupy, but Linux lacks such limits.

Rik van Riel seems to have started off on a series of relatively simple patches which address immediate VM issues. His latest patch implements resident set size limits for Linux processes. Once this patch is applied, a bit of appropriate limit setting could do a lot to keep those memory hog processes in their place.

The core of the patch comes down to two lines:

    if (mm->rss > mm->rlimit_rss)
	referenced = 0;

This code appears in the function page_referenced_one(), which tries to decide whether a process has actually made use of one of its in-core pages. If the page has not been referenced, it goes directly onto the list of pages to reclaim. All that this particular patch is doing is pretending that a process which has exceeded its maximum resident set size has not actually used any of its pages; as a result, the memory hog's pages will be the first ones to be reclaimed.

This patch applies on top of the token-based mechanism discussed last week. It modifies that code by depriving a process of the swap token once it goes over its memory limit.

Many systems in the past have chosen to implement hard resident set size limits. On such systems, a process which incurs a page fault will, if it's at its memory limit, immediately surrender one other page back to the memory management system. Rik's patch works differently, in that there are no hard limits. If there is no particular memory pressure, a process can grow to any size. The limit is only applied when the system starts looking for pages to reclaim for other users. This approach is simple, which is always good; it also allows the system to make full use of its memory when there is not a lot of contention.

Comments (1 posted)

Out-of-lining spinlocks

Spinlocks, as the core kernel synchronization primitive, are highly performance critical. They are implemented differently on each architecture, by way of some carefully-crafted assembly code, so that not one extra cycle is spent there, especially when the lock is not contended. They are also implemented as inline assembly, so that no function calls get in the way of that fast path through.

Recently, however, Zwane Mwaikambo has pulled a patch out of the -tiny tree which moves spinlocks into normal, out-of-line functions - at least, on the x86 and x86-64 architectures. The reason for doing this is to shrink the kernel; there are a lot of spinlock calls in the kernel, and the inline code gets replicated for every one of them. Moving the spinlock code out of line gets rid of that duplication, and shrinks the kernel text size by 50KB or so.

Zwane posted some benchmarks showing that there are no performance regressions. In fact, on some hardware, the improved cache utilization brought about by pulling together the spinlock code can actually improve performance by a slight amount.

The patch comes with a configuration option allowing the spinlock code to be built in either mode. Given that moving the code out of line seems to be a win, some have wondered if things shouldn't always be done that way. Linus pointed out one advantage to the inline code: it makes the sources of lock contention very clear in kernel profiles. With out-of-line spinlocks, all a profile will show is that a lot of time was spent waiting for locks; with the code inline, the function which is actually waiting for the lock shows up instead. So out-of-line locks may be best for production kernels, but developers may want to keep them inline.

Comments (2 posted)

Presentations from the cluster summit

The Minneapolis Cluster Summit, held on July 29 and 30, was a gathering of developers interested in pushing forward the state of the art in Linux clustering. The slides from the presentations have now been posted. The topics covered include high availability, OpenSSI, cluster block devices, GFS, lock management, and more.

Comments (1 posted)

Linus Torvalds Linux 2.6.8-rc4 ?

Andrew Morton 2.6.8-rc4-mm1 ?

Andrew Morton 2.6.8-rc3-mm1 ?

Andrew Morton 2.6.8-rc3-mm2 ?

Nick Piggin 2.6.8-rc3-np1 ?

Marcelo Tosatti linux-2.4.27 released ?

Eric Hustvedt 2.4.27-lck1 ?

Andi Kleen x86_64-2.6.8rc3-1 released ?

James Morris Re-implemented i586 asm AES ?

Zwane Mwaikambo Completely out of line spinlocks / x86_64 ?

Zwane Mwaikambo Completely out of line spinlocks / i386 ?

Suparna Bhattacharya Various AIO retry related fixes and enhancements ?

Suparna Bhattacharya Collected AIO retry fixes and enhancements ?

Suparna Bhattacharya AIO Splice runlist for fairness across io contexts ?

Suparna Bhattacharya AIO workqueue context switch reduction ?

Rick Lindsley schedstats and staircase scheduler ?

Con Kolivas Staircase scheduler for 2.6.8-rc3-mm2 ?

Con Kolivas Staircase scheduler for 2.6.8-rc4-mm1 ?

Peter Williams V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation ?

Peter Williams V-4.0 Single Priority Array O(1) CPU Scheduler Evaluation ?

Ingo Molnar voluntary-preempt-2.6.8-rc3-O4 ?

Ingo Molnar voluntary-preempt-2.6.8-rc3-O5 ?

Ingo Molnar preempt-smp.patch, 2.6.8-rc3-mm2 ?

Prasanna S Panchamukhi kprobes-base-268-rc3.patch ?

Prasanna S Panchamukhi kprobes-func-args-268-rc3.patch ?

Prasanna S Panchamukhi kprobes-netfilter-268-rc3.patch ?

Prasanna S Panchamukhi kprobes-netpktlog-268-rc3.patch ?

Nigel Cunningham RFC: Device tree patch (support for partial tree suspend/resume) ?

Hannes Reinecke hotplug resource limitation ?

Patrick Mochel Fix Device Power Management States ?

Nathan Bryant SCSI midlayer power management ?

Linus Torvalds Remove ESPIPE logic from drivers, letting the VFS layer handle it instead. ?

Pete Zaitcev ub update #7 ?

Linus Torvalds Add infrastructure for the VFS layer to mark files seekable. ?

Linus Torvalds Add pread/pwrite support bits to match the lseek bit. ?

Linus Torvalds Add "nonseekable_open()" helper functions for nonseekable ?

Linux Kernel Mailing List Make sysctl pass the pos pointer around properly. ?

Linus Torvalds read/write: pass down a copy of f_pos, not f_pos itself. ?

Ingo Molnar inode-lock-break.patch, 2.6.8-rc3-mm2 ?

Chen, Kenneth W Hugetlb demanding paging for -mm tree ?

Rik van Riel RSS ulimit enforcement for 2.6.8 ?

Jean Tourrilhes Wireless Extension v17 for Linus ?

David Howells implement in-kernel keys & keyring management ?

David Howells implement in-kernel keys & keyring management [try #2] ?

David Howells implement in-kernel keys & keyring management [try #6] ?

David Howells keys & keyring management: key filesystem ?

Kurt Garloff [LSM] Rework LSM hooks ?

Michael Halcrow settime hooks (1/1) ?

Pavel Machek Allow userspace do something special on overtemp ?

Neil Brown ANNOUNCE: mdadm 1.7.0 - A tool for managing Soft RAID under Linux ?

Kernel development

Brief items

Kernel release status

Kernel development news

Safe seeks

Simple resident set size limits

Out-of-lining spinlocks

Presentations from the cluster summit

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous