Kernel development [LWN.net]

Kernel release status

The current 2.6 prepatch is 2.6.12-rc6, released by Linus on June 6. This one should, if all goes well, be the final testing release before 2.6.12 comes out. Most of the patches are basic fixes, but there is also the (temporary, hopefully) removal of the Philips webcam decompression code, the conversion of the IDE code over to the device model way of doing things, a CPU frequency controller update, and a user-mode Linux update. See the long-format changelog for the details.

Linus's git repository has since accumulated a few dozen small fixes.

The current -mm tree is 2.6.12-rc6-mm1. Recent additions to -mm include semi-persistent permissions for sysfs files, the "scalable TCP" congestion control algorithm, hotplug CPU support for the x86_64 architecture, RapidIO support (see below), an NFS update, an unlocked_ioctl() operation for block devices, and the v9fs filesystem (covered here last month).

Comments (none posted)

Quotes of the week

My things-to-worry-about folder still has 244 entries. Nobody seems to care much. Poor me.

-- Andrew Morton

This is the kind of crap that happens when drivers in the kernel are not self contained, and need "external stuff" to work properly. It means that simple things like NFS root over the device do not work in a straightforward, simple, and elegant manner.

I am likely to always take the position that device firmware belongs in the kernel proper, not via these userland and filesystem loading mechanism, none of which may be even _available_ when we first need to get the device going.

-- David Miller

Comments (4 posted)

A summary of the realtime Linux discussion

Paul McKenney has taken some time and written up a detailed summary of the current status of Linux realtime support. The resulting document (click below) starts with a discussion of the problem, then works through the various approaches being taken to provide realtime response with Linux. Worth a read if you have any interest in this area.

Full Story (comments: 5)

The dynamic tick patch

The timer interrupt is one of the most predictable events on a Linux system. Like a heartbeat, it pokes the kernel every so often (about every 1ms on most systems), enabling the kernel to note the passage of time, run internal timers, etc. Most of the time, the timer interrupt handler just does its job and nobody really notices.

There are times, however, when this interrupt can be unwelcome. Many processors, when idle, can go into a low-power state until some work comes along. To such processors, the timer interrupt looks like work. If there is nothing which actually needs to be done, however, then the processor might be powering up 1000 times per second for no real purpose. Timer interrupts can also be an issue on virtualized systems; if a system is hosting dozens of Linux instances simultaneously, the combined load from each instance's timer interrupt can add up to a substantial amount of work. So it has often been thought that there would be a benefit to turning off the timer interrupt when there is nothing for the system to do.

Tony Lindgren's dynamic tick patch is another attempt to put a lid on the timer interrupt. This version of the patch only works on the i386 architecture, but it is simple enough that porting it to other platforms should not be particularly difficult.

The core of the patch is a hook into the architecture-specific cpu_idle() function. If a processor has run out of work and is about to go idle, it first makes a call to dyn_tick_reprogram_timer(). That function checks to see whether all other processors on the system are idle; if at least one processor remains busy, the timer interrupt continues as always. Experience has shown that trying to play games with the timer interrupt while the system is loaded leads to a net loss in performance - the overhead of reprogramming the clock outweighs the savings. So, if the system is working, no changes are made to the timer.

If, instead, all CPUs on the system are idle, there may be an opportunity to shut down the timer interrupt for a while. When the system goes idle, there are only two events which can create new work to do: the completion of an I/O operation or the expiration of an internal kernel timer. The dynamic tick code looks at when the next internal timer is set to go off, and figures it might be able to get away with turning off the hardware timer interrupt until then. After applying some tests (there are minimum and maximum allowable numbers of interrupts to skip), the code reprograms the hardware clock to interrupt after this time period, and puts the processor to sleep.

At some point in the future, an interrupt will come along and wake the processor. It might be the clock interrupt which had been requested before, or it could be some other device - a keyboard or network interface, for example. The dynamic tick code hooks into the main interrupt handler, causing its own handler to be invoked for every interrupt on the system, regardless of source. This code will figure out how many clock interrupts were actually skipped, then loop calling do_timer_interrupt() until it catches up with the current time. Finally, the interrupt handler restores the regular timer interrupt, and the system continues as usual.

The end result is a system which can drop down to about 6 timer interrupts per second when nothing is going on. That should eventually translate into welcome news for laptop users and virtual hosters running Linux.

Comments (7 posted)

RapidIO support for Linux

One of the patch sets which showed up in the 2.6.12-rc6-mm1 kernel is the RapidIO subsystem, contributed by Matt Porter (of Montavista). Your editor, being ignorant of the RapidIO standard, decided to have a look. RapidIO turns out to be a sort of backplane interconnect intended mainly for embedded systems. It allows for multiple hosts to exist on the same bus and work collaboratively with the available peripherals. It is a sort of highly local area network.

The RapidIO site provides no end of highly detailed specifications for the truly curious. The rest of us, however, can learn a lot by looking at a network driver packaged with the rest of the Linux RapidIO patch. This driver provides a simple example of how to use the API provided by the RapidIO layer; it enables network packets to be exchanged with another host on the RapidIO bus.

The RapidIO subsystem is integrated with the device model, so it provides the expected structures: rio_dev and rio_driver. Drivers can register a probe() function which enables them to take responsibility for devices (which can be other hosts) as they turn up on the interconnect. The example network driver uses a wildcard ID table so that it is given the opportunity to work with all other devices out there; it will happily send packets to any suitably capable device.

"Suitably capable," in this case, means that the device implements the two basic primitives used to communicate across the RapidIO interconnect. "Doorbells" are a way of sending simple, out-of-band signals to remote nodes; the doorbells used by the network driver are those which announce device addition and removal events. Most work, however, is done with "mailboxes," essentially a reliable packet delivery service. If one RapidIO device sends a message to another via a mailbox, the lower levels will do their best to ensure that the message arrives uncorrupted and in the right order.

So how does one RapidIO network node send a packet to another? Taking out the usual overhead and error handling, it comes down to the following:

    static int rionet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
    {
        struct rionet_private *rnet = ndev->priv;

	rio_add_outb_message(rnet->mport, rdev, 0, skb->data, skb->len);
    }

rdev is a rio_dev structure corresponding to the destination host on the RapidIO backplane. This call sends the data in the network packet (skb) out through the given mailbox to the desired device. When the transmission is complete, the driver will receive a callback so that it can perform any necessary cleanup (freeing the skb in this case).

Packet reception requires setting up a ring of receive buffers, much like one would see in any network driver. In this case, the necessary code looks like:

    do {
	rnet->rx_skb[i] = dev_alloc_skb(RIO_MAX_MSG_SIZE);

	if (!rnet->rx_skb[i])
	    break;

	rio_add_inb_buffer(rnet->mport, RIONET_MAILBOX,
			   rnet->rx_skb[i]->data);
    } while ((i = (i + 1) % RIONET_RX_RING_SIZE) != end);

The RapidIO subsystem maintains a list of buffers waiting for incoming mailbox messages; new buffers are added with rio_add_inb_buffer(). When a message actually shows up, the driver gets a callback (established when the mailbox is allocated), which, in the end, does the following:

    if (!(data = rio_get_inb_message(rnet->mport, RIONET_MAILBOX)))
	break;
    rnet->rx_skb[i]->data = data;
    skb_put(rnet->rx_skb[i], RIO_MAX_MSG_SIZE);
    error = netif_rx(rnet->rx_skb[i]);

The code assumes that anything arriving on the given mailbox will be a network packet. Beyond that, little checking is required; all of the details, including data integrity checks, will have been taken care of by the lower levels.

The list of RapidIO-capable devices is small at the moment, but appears to be growing. As these devices become available, Linux will have the low-level infrastructure needed to support them. The embedded Linux community has often been accused of keeping its work to itself and not contributing back to the kernel as a whole. The contribution of the RapidIO subsystem is another sign that this situation may be changing; that, perhaps, is more welcome than the code itself.

Comments (none posted)

Automated kernel testing

If there is one thing that almost all kernel developers agree with, it's that more testing is a good thing - especially if the results are presented in a useful way. Martin Bligh thus got a warm reception when he announced a new kernel testing facility. As he put it:

Currently it builds and boots any mainline, -mjb, -mm kernel within about 15 minutes of release. runs dbench, tbench, kernbench, reaim and fsx. Currently I'm using a 4x AMD64 box, a 16x NUMA-Q, 4x NUMA-Q, 32x x440 (ia32) PPC64 Power 5 LPAR, PPC64 Power 4 LPAR, and PPC64 Power 4 bare metal system.

This is, indeed, a fairly wide range of coverage. The results are presented as a simple table, showing which kernels passed the tests and which did not. When a kernel fails a test, the relevant information is provided (though, often, that information is simply "did not boot," which is not entirely helpful).

These results have been augmented with benchmark results, presented in a handy graphic form. The graph shown on the right, for example, notes that kernbench performance improved significantly around 2.6.6, and has held steady since 2.6.10. The -mm trees, however, perform notably worse than the mainline, and the difference between the two has been growing. The results have already led to some investigation into what is going on; the current suspect is the (36!) scheduler patches currently living in -mm.

Numerous others have worked at testing and benchmarking kernel releases. Martin's work, however, has the advantages of being automated and presenting the results in a reasonable way. With these attributes, this project stands a good chance of helping the developers to produce better kernels in the near future.

Comments (6 posted)

Linus Torvalds Linux v2.6.12-rc6 ?

Andrew Morton 2.6.12-rc6-mm1 ?

Tony Lindgren Dynamic tick for x86 version 050602-1 ?

Rusty Lynch ia64 function return probes ?

Ashok Raj x86_64 CPU hotplug patch series. ?

Ashok Raj x86_64: Dont use broadcast shortcut to make it cpu hotplug safe. ?

Ashok Raj x86_64: CPU hotplug sibling map cleanup ?

Ashok Raj x86_64: Change init sections for CPU hotplug support ?

Ashok Raj x86_64: Provide ability to choose using shortcuts for IPI in flat mode. ?

Ashok Raj x86_64: CPU hotplug support. ?

Peter Williams PlugSched-5.1 for 2.6.11, 2.6.12-rc5 and 2.6.12-rc5-mm2 ?

Peter Williams PlugSched-5.1 for 2.6.12-rc6 and 2.6.12-rc6-mm1 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc6-V0.7.47-20 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc6-V0.7.47-29 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc6-V0.7.48-00 ?

Daniel Walker local_irq_disable removal ?

Thomas Gleixner Softirq splitting ?

Martin J. Bligh automated linux kernel testing results ?

Jeff Garzik git-shortlog script ?

Marty Ridgeway June Release of LTP available ?

Alex Aizman Open-iSCSI/Linux-iSCSI-5 High-Performance Initiator ?

jayalk@intworks.biz Framebuffer driver for Arc LCD board ?

Jeff Garzik libata-dev queue updated ?

Dmitry Torokhov Introduce usb_to_input_id ?

Greg KH PCI: remove access to pci_[enable|disable]_msi() for drivers ?

Matt Porter RapidIO support: core base ?

Matt Porter RapidIO support: core includes ?

Matt Porter RapidIO support: core enum ?

Matt Porter RapidIO support: ppc32 ?

Matt Porter RapidIO support: net driver over messaging ?

Jens Axboe SATA NCQ #4 ?

Phillip Hellewell eCryptfs: eCryptfs kernel module ?

Phillip Hellewell eCryptfs: export key type ?

Phillip Hellewell eCryptfs: Kconfig and Makefile ?

ericvh@gmail.com v9fs: Plan 9 resource sharing protocol (2.0-rc7) ?

Tejun Heo blk: ordered request reimplementation (take 2, for review) ?

Miklos Szeredi file position info in /proc ?

Mike Christie add blk_rq_map_kern function ?

Robert Love inotify for 2.6.12-rc6. ?

Ingo Molnar [patch] consolidate/clean up spinlock.h files ?

Ingo Molnar spinlock consolidation, v2 ?

Mel Gorman Avoiding external fragmentation with a placement policy Version 13 ?

Christoph Lameter vmalloc with the ability to specify a node ?

Stephen Hemminger Scalable TCP (cleaned) ?

Stephen Hemminger 2.6.12-rc6-tcp1 ?

Stephen Hemminger net: allow controlling NAPI weight with sysfs ?

David S. Miller : Tigon3 new NAPI locking v2 ?

Herbert Xu Replace scatterlist with crypto_frag ?

David Stevens IPV6 RFC3542 compliance [PATCH] ?

David S. Miller : TCP: The Road to Super TSO ?

Douglas Gilbert sdparm 0.93 ?

Douglas Gilbert sg3_utils-1.15 available ?

Ananth N Mavinakayanahalli Libsysfs-v1.3.0 ?

Anthony Awtrey Hotplug-Perl 1.1 Update ?

Jonathan Woithe ANN: set_rtlimits 1.1.0 released ?

Stephen Hemminger iproute2-ss050607 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

A summary of the realtime Linux discussion

The dynamic tick patch

RapidIO support for Linux

Automated kernel testing

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking

Miscellaneous