Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.30-rc2, released on April 14. "New 'microblaze' architecture, a somewhat late 'input' layer merge, a new intel virtual networking driver and some firmware loading updates. And mn10300 and frv moved their header files from include/asm to arch. That accounts for the bulk, but shouldn't affect anybody." The short-form changelog is in the announcement; see the full changelog for all the details.

There have been no stable 2.6 updates released in the last week, and none are in the review process.

Comments (none posted)

Kernel development news

Quotes of the week

When the revolution comes, and the people who haven't converted to git get sent to the gulags, we'll make "-M" the default.

-- Linus Torvalds

I have been asked to include aufs into mainline from several people several times. As long as you have strong NACK for aufs and reject all union-type filesystems, I have to give up unwillingly and will answer them "Aufs was rejected. Let's give it up."

-- J.R. Okajima gives up

    while(my_rootfs_hasnt_appeared_and_i_am_sad()) {
	wait_on(&new_disk_discovery);
    }

-- Alan Cox extends the boot API

IBM has a well-known disdain for vowels, and basically refuses to use them for mnemonics (they were called on this, and did "eieio" as an instruction just to try to make up for it).

But I'm from Finland. In Finnish, about 75% of all letters are vowels. I find this dis-emvoweling to be stupid and impractical. Without vowels, you can't tell Finnish words apart (admittedly, _with_ vowels, you generally cannot pronounce them, so to a non-Finn it doesn't much matter).

-- Linus Torvalds (thanks to Ben Hutchings)

Comments (13 posted)

Notes from the LSF storage track

Our recent coverage from the 2009 Linux Storage and Filesystem Workshop (day 1, day 2) contained no notes from the storage track - an unfortunate result of your editor's inability to be in two places at the same time. Happily, James Bottomley took good notes, which he has now made available to us to publish. Topics covered include multipathing, I/O scheduling and tracing, ATA issues, and more; click below for the full text.

Full Story (comments: 3)

Rebasing and merging: some git best practices

By Jonathan Corbet
April 14, 2009

In a typical development cycle, Linus Torvalds pulls patches from over 100 git trees into the mainline repository. While this is going on, it's not unusual for him to complain about how some of those trees are managed; most of the gripes have to do with excessive use of rebasing and merging operations. In a recent discussion on the dri-devel list, Linus clarified his rules somewhat on subsystem tree management. Your editor, on the theory that there might be a developer or two out there who does not read dri-devel, thought that it might be good to expose those rules more widely.

The git "rebase" operation takes a set of patches applied to one tree and reworks them to apply to a different tree. If a developer has written some patches against 2.6.29, he or she can use "git rebase" to turn them into patches against 2.6.30-rc1 instead. With git, rebasing can also be used to make edits to the commit history. If something needs to be fixed in a patch which was made some time ago, the developer can (1) remove the more recent patches from the tree, (2) make the needed changes, and (3) rebase the removed patches back onto the fixed patch. This technique can be used to silently disappear an embarrassing bug from the history, improve patch changelogs, fix a patch conflict against somebody else's tree, and more. It's something that git-based developers simply end up doing occasionally.

There are a couple of problems associated with rebasing, though. One of those is that it changes the commit history. Whenever a series of commits is rebased, anybody who was working with the old history is left out in the cold. If a heavily-used tree is rebased, all developers depending on that tree are forced to scramble to readjust to the new reality. The other problem is that rebased patches are changed patches; any testing that they saw may no longer be applicable. That is why Linus tends to grumble hard at trees which have obviously been rebased just prior to the sending of a pull request. The changes in those trees probably worked before the rebase, but the post-rebase changes have not been tested and may not work as well.

Rebasing is clearly a useful technique, though. Linus does not tell developers not to use it; in fact, he encourages it sometimes. The key rule that was passed down is this: Thou Shalt Not Rebase Trees With History Visible To Others. If a developer has pulled in somebody else's tree, the resulting tree can no longer be rebased, since that would break the second developer's history. Similarly, once a tree has been exported such that others may be making use of it, it can no longer be rebased.

On the other hand, private history can be rebased at will - and it probably should be. If a patch is seen to introduce a bug, it's best to fix it at the source rather than reverting it or adding a second, fixup patch; the result is a cleaner history which is less likely to create problems for people trying to bisect unrelated bugs. Your editor has found that rebasing is often needed to add tags ("Acked-by," for example) to patches which have been circulated for review. When one is creating a set of patches for the mainline kernel, one is really creating an entire history, not just the end result. Making that history clean and readable is to everybody's benefit.

The associated rule that goes with this, though, is that trees which are subject to rebasing should not be exposed to the world:

This means: if you're still in the "git rebase" phase, you don't push it out. If it's not ready, you send patches around, or use private git trees (just as a "patch series replacement") that you don't tell the public at large about.

So, in other words, trees which might be rebased should be kept private. They should also not have other developers' trees pulled into them.

It's worth noting that Linus very much practices what he preaches on this front. The mainline git repository accepts 10,000 or so changesets every development cycle, but it is never rebased. And that is a good thing: rebasing the mainline would cause massive pain throughout the development community.

Merging is the other place where subsystem maintainers can run afoul of the Chief Penguin. A "merge" in git is similar to a merge in most other source code management systems; it joins two (or more) independent lines of development into the current branch. Git merges differ, though, in that they can have more than two incoming branches; Ingo Molnar is famous for his use of "octopus merges" joining vast numbers of branches in a single operation. In almost all cases, performing a merge adds a special commit to the repository indicating that the merge has been done and noting which files, if any, had conflicts.

Merges go both ways. When Linus pulls a subsystem tree into the mainline, the result is a merge. But it is also common for developers to perform merges in the other direction; they will pull the mainline (or some higher-level subsystem tree) into a branch containing a local line of development. It is natural to want to develop code against the current state of the art; it gives confidence that the end result will work with everybody else's changes and minimizes the chances of an ugly merge conflict later on.

But excessive pulling from the mainline (as evidenced by the merge commits which result) tends to irritate Linus. As he put it:

But if I see a lot of "Merge branch 'linus'" in your logs, I'm not going to pull from you, because your tree has obviously had random crap in it that shouldn't be there. You also lose a lot of testability, since now all your tests are going to be about all my random code.

As anybody who has worked with tip-of-the-repository kernels knows, the state of the mainline at any random point can be, well, random. So frequent pulling of the mainline into a development branch will add a certain amount of randomness to that branch; this randomness is not particularly helpful for somebody who is trying to get a feature working. It also increases the chances that another developer who ends up in the middle of the series while running a bisect operation will encounter unrelated bugs. So Linus would rather that developers not pull down from upstream trees:

And, in fact, preferably you don't pull my tree at ALL, since nothing in my tree should be relevant to the development work _you_ do. Sometimes you have to (in order to solve some particularly nasty dependency issue), but it should be a very rare and special thing, and you should think very hard about it.

The reality of the situation tends not to be so strict, though. An occasional merge to stay on top of what's happening elsewhere can make sense. What Linus suggests, though, is that the merges happen at specific release points. So pulling the tip of the mainline tree into a development tree probably does not make sense, but there might be an argument for pulling in 2.6.29 or 2.6.30-rc1. Doing things this way allows development to be based on a (hopefully) relatively stable point where the issues are known.

The temptation to merge in the mainline during development can be hard to resist; one likes to know whether one's work is even remotely relevant to the current state of the code. Fortunately, git makes it really easy to create throwaway branches and test out merges and integration there. Once it's clear that things work, the test branch can be deleted and the (unmerged) development branch sent upstream.

Similar rules apply to the merging of downstream code. The receiving repository should be in a reasonably well defined and stable state; typically developers maintain a "for upstream" branch for this kind of merge. And the downstream code should be "ready": it should be at some sort of release point and not in a random state of development.

Of course, these rules are not absolute:

Git does allow people to do many different things, and solve problems different ways. I just want all the regular workflows to be "good practice", but then if you have to occasionally break the rules to solve some odd problem, go ahead and break the rules (and tell people why you did it that way this time).

Linus first started playing with BitKeeper in February, 2002, so the kernel community now has seven years worth of experience with distributed version control. But the truth of the matter is that we are still figuring out the best way to use this particular tool. This is a process which is likely to continue for some time yet. As other large projects move toward using tools like git, they may want to look hard at the processes and rules which have been developed in the kernel community; they might just be able to shorten their own learning experience.

Comments (1 posted)

Solving the ext3 latency problem

By Jonathan Corbet
April 14, 2009

One might think that the ext3 filesystem, by virtue of being standard on almost all installed Linux systems for some years now, would be reasonably well tuned for performance. Recent events have shown, though, that some performance problems remain in ext3, especially in places where the fsync() system call is used. It's impressive what can happen when attention is drawn to a problem; the 2.6.30 kernel will contain fixes which seemingly eliminate many of the latencies experienced by ext3 users. This article will look at the changes that were made, including a surprising change to the default journaling mode made just before the 2.6.30-rc1 release.

The problem, in short, is this: the ext3 filesystem, when running in the default data=ordered mode, can exhibit lengthy stalls when some process calls fsync() to flush data to disk. This issue most famously manifested itself as the much-lamented Firefox system-freeze problem, but it goes beyond just Firefox. Anytime there is reasonably heavy I/O going on, an fsync() call can bring everything to a halt for several seconds. Some stalls on the order of minutes have been reported. This behavior has tended to discourage the use of fsync() in applications and it makes the Linux desktop less fun to use. It's clearly worth fixing - but nobody did that for years.

When Ted Ts'o looked into the problem, he noticed an obvious problem: data sent to the disk via fsync() is put at the back of the I/O scheduler's queue, behind all other outstanding requests. If processes on the system are writing a lot of data, that queue could be quite long. So it takes a long time for fsync() to complete. While that is happening, other parts of the filesystem can stall, eventually bringing much of the system to a halt.

The first fix was to mark I/O requests generated by fsync() with the WRITE_SYNC operation bit, marking them as synchronous requests. The CFQ I/O scheduler tries to run synchronous requests (which generally have a process waiting for the results) ahead of asynchronous ones (where nobody is waiting). Normally, reads are considered to be synchronous, while writes are not. Once the fsync()-related requests were made synchronous, they were able to jump ahead of normal I/O. That makes fsync() much faster, at the expense of slowing down the I/O-intensive tasks in the system. This is considered to be a good tradeoff by just about everybody involved. (It's amusing to note that this change is conceptually similar to the I/O priority patch posted by Arjan van de Ven some time ago; some ideas take a while to reach acceptance).

Block subsystem maintainer Jens Axboe disliked the change, though, stating that it would cause severe performance regressions for some workloads. Linus made it clear, though, that the patch was probably going to go in, and that, if the CFQ I/O scheduler couldn't handle it, there would soon be a change to a different default scheduler. Jens probably would have looked further in any case, but the extra motivation supplied by Linus is unlikely to have slowed this process down.

The problem, as it turns out, is that WRITE_SYNC actually does two things: putting the request onto the higher-priority synchronous queue, and unplugging the queue. "Plugging" is the technique used by the block layer to issue requests to the underlying disk driver in bursts. Between bursts, the queue is "plugged," causing requests to accumulate there. This accumulation gives the I/O scheduler an opportunity to merge adjacent requests and issue them in some sort of reasonable order. Judicious use of plugging improves block subsystem performance significantly.

Unplugging the queue for a synchronous request can make sense in some situations; if somebody is waiting for the the operation, chances are they will not be adding any adjacent requests to the queue, so there is no point in waiting any longer. As it happens, though, fsync() is not one of those situations. Instead, a call to fsync() will usually generate a whole series of synchronous requests, and the chances of those requests being adjacent to each other is fairly good. So unplugging the queue after each synchronous request is likely to make performance worse. Upon identifying this problem, Jens posted a series of patches to fix it. One of them adds a new WRITE_SYNC_PLUG operation which queues a synchronous write without unplugging the queue. This allows operations like fsync() to create a series of operations, then unplug the queue once at the end.

While he was at it, Jens fixed a couple of related issues. One was that the block subsystem can still sometimes run synchronous requests behind asynchronous requests in some situations. The code here is a bit tricky, since it may be desirable to let a few asynchronous requests through occasionally to prevent them from being starved entirely. Jens changed the balance to ensure that synchronous requests get through in a timely manner.

Beyond that, the CFQ scheduler uses "anticipatory" logic with synchronous requests; upon executing one such request, it will stall the queue to see if an adjacent request shows up. The idea is that the disk head will be ideally positioned to satisfy that request, so the best performance is obtained by not moving it away immediately. This logic can work well for synchronous reads, but it's not helpful when dealing with write operations generated by fsync(). So now there's a new internal flag that prevents anticipation when WRITE_SYNC_PLUG operations are executed.

Linus liked the changes:

Goodie. Let's just do this. After all, right now we would otherwise have to revert the other changes as being a regression, and I absolutely _love_ the fact that we're actually finally getting somewhere on this fsync latency issue that has been with us for so long.

It turns out that there's a little more, though. Linus noticed that he was still getting stalls, even if they were much shorter than before, and he wondered why:

One thing that I find intriguing is how the fsync time seems so _consistent_ across a wild variety of drives. It's interesting how you see delays that are roughly the same order of magnitude, even though you are using an old SATA drive, and I'm using the Intel SSD.

The obvious conclusion is that there was still something else going on. Linus's hypothesis was that the volume of requests pending to the drive was large enough to cause stalls even when the synchronous requests go to the front of the queue. With a default configuration, requests can contain up to 512KB of data; stack up a couple dozen or so of those, and it's going to take the drive a little while to work through them. Linus experimented with setting the maximum size (controlled by /sys/block/drive/queue/max_sectors_kb) to 64KB, and reports that things worked a lot better. As of this writing, though, the default has not been changed; Linus suggested that it might make sense to cap the maximum amount of outstanding data, rather than the size of any individual request. More experimentation is called for.

There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4. Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata. As a result, data loss resulting from a system crash should be less of a problem.

Sidebar: data=guarded

Another alternative to data=ordered may be the data=guarded mode proposed by Chris Mason. This mode would delay size updates to prevent information disclosure problems. It is a very new patch, though, which won't be ready for 2.6.30.

The other potential problem with data=writeback is that, in some situations, a crash can leave a file with blocks allocated to it which have not yet been written. Those blocks may contain somebody else's old data, which is a potential security problem. Security is a smaller issue than it once was, for the simple reason that multiuser Linux systems are relatively scarce in 2009. In a world where most systems are dedicated to a single user, the potential for information disclosure in the event of a crash seems vanishingly small. In other words, it's not clear that the extra security provided by data=ordered is worth the associated performance costs anymore.

So Ted suggested that, maybe, data=writeback should be made the default. There was some resistance to this idea; not everybody thinks that ext3, at this stage of its life, should see a big option change like that. Linus, however, was unswayed by the arguments. He merged a patch which creates a configuration option for the default ext3 data mode, and set it to "writeback." That will cause ext3 mounts to silently switch to data=writeback mode with 2.6.30 kernels. Says Linus:

I'm expecting that within a few months, most modern distributions will have (almost by mistake) gotten a new set of saner defaults, and anybody who keeps their machine up-to-date will see a smoother experience without ever even realizing _why_.

It's worth noting that this default will not change anything if (1) the data mode is explicitly specified when the filesystem is mounted, or (2) a different mode has been wired into the filesystem with tune2fs. It will also be ineffective if distributors change it back to "ordered" when configuring their kernels. Some distributors, at least, may well decide that they do not wish to push that kind of change to their users. We will not see the answer to that question for some months yet.

Comments (53 posted)

Hotplug file descriptors and revoke()

By Jonathan Corbet
April 14, 2009

Once upon a time, operating systems did not have to worry about hardware coming and going at awkward times. Whatever peripherals were bolted into the box when the system booted could be counted on to still be there at shutdown time. Contemporary systems don't work that way; devices will come and go at the whim of the user. Various subsystems have evolved mechanisms for coping with hardware which suddenly vanishes, but that code tends to be subsystem-specific and complex. Eric Biederman recently encountered this code and didn't really like what he saw. So he has set out to make something better.

Eric's patch series begins with this observation:

Not long after I touched the tun driver and made it safe to delete the network device while still holding it's file descriptor open I [saw] someone else touch the code adding a different feature and my careful work went up in flames. Which brought home another point: at the best of it this is ultimately complex tricky code that subsystems should not need to worry about.

Eric also notes that the growth in hotplug-capable PCI devices will increase the number of subsystems and drivers which need to be prepared for this eventuality. Rather than spread hotplug-specific code through more parts of the kernel, he would like to create one central, well-supported mechanism.

The issue that Eric is looking at in particular is: what happens to open file descriptors when the underlying resource goes away? Regardless of whether that resource is a physical device, a module, or something different altogether, the kernel needs to do a right thing when the file descriptor no longer points to something valid. Eric's patches create a new infrastructure which allows any subsystem to easily revoke access to a file descriptor in a more reliable and robust manner than has been seen before.

The first issue that comes up is, invariably, mmap(). If a no-longer-existing device or file has been mapped into a process's address space, interesting and unpleasant things could happen. Eric's answer is a new function:

    void remap_file_mappings(struct file *file, 
    			     struct vm_operations_struct *vm_ops);

A call to remap_file_mappings() will locate every virtual memory area (VMA) associated with the given file. All mapped pages will be unmapped, making them inaccessible to the process which had mapped them. The operations associated with the VMA will be replaced with vm_ops; those operations will normally be revoked_vm_ops, which simply return a bus error whenever the process attempts to access one of the affected pages.

The kernel also clearly needs to block any other operations - read(), write(), ioctl(), etc. - which might be performed on this file descriptor. The way to do that, of course, is to replace the file_operations structure associated with the file. The function to do that is:

    int fops_substitute(struct file *file, const struct file_operations *f_op,
			struct vm_operations_struct *vm_ops);

One might imagine that this function could be quite simple, along the lines of:

    file->f_op = f_op;
    remap_file_mappings(file, vm_ops);

But the truth of the matter is rather more complicated. To begin with, there may be threads running in the old file operations, and some of those might be waiting for events which will, now, never happen. As a way of helping drivers unwedge themselves in this situation, Eric's patches add a new entry to struct file_operations:

    int (*awaken_all_waiters)(struct file *filp);

This function should cause any thread which is waiting for the given file to wake up and take note that the world has changed.

The next sticking point is that, now that the file operations have been swapped out, there is no way for the underlying driver to know when all file descriptors have been closed. That is handled by waiting until there are no more known users of the old file operations, then calling the release() function directly from fops_substitute(). That leads to the sticky question of what happens if some thread never wakes up and the usage count never goes to zero; in the current patch, fops_substitute() will simply hang in this situation.

Before one can even worry about that, though, there is the troublesome point that the kernel has no idea how many users of a given file_operations structure exist. So Eric has had to add a reference counting mechanism. In the new way of doing things, any kernel code must bracket calls into a file's file_operations with:

    int fops_read_lock(struct file *file);
    void fops_read_unlock(struct file *file, int revoked);

The return value from fops_read_lock() (which Eric invariably calls fops_idx) is non-zero if access to the file has already been revoked; it must be passed into the matching call to fops_read_unlock(). The biggest part of the patch series is a slog through the core VFS code adding locking around every file_operations access. That's a lot of little code changes which have to be made in a lot of places.

There is a payoff, though: the handling of revoked files in various other subsystems can be ripped out and replaced with the new, generic code. The changes to the /proc filesystem, for example, leave the code almost 400 lines shorter. So the kernel gets smaller, and the new code, should, with luck, be more robust and more maintainable.

This mechanism is useful for situations where devices disappear, but there is also a bigger goal in sight. There has long been a desire for a generic revoke() system call which would disconnect all open descriptors to a given file or device. It could be used to implement some sort of secure attention key, killing all processes which have open file descriptors to a console device, for example. revoke() would also be useful for forced unmounting of filesystems. It's a useful idea, with only one problem: revoke() is really hard. Nobody has yet come through with an implementation that looks complete and robust enough to be put into the kernel.

Eric's patch set has not gotten there yet either. But it does represent another stab at the problem using an approach which, most developers agree, is the way that revoke() needs to be implemented. Over time, it might just evolve into the general solution which has evaded other developers for years.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.30-rc2 ?

Thomas Gleixner 2.6.29.1-rt5 ?

Thomas Gleixner 2.6.29.1-rt6 ?

Thomas Gleixner 2.6.29.1-rt7 ?

Architecture-specific

H. Peter Anvin x86, setup: "glove box" BIOS interrupts ?

James Bottomley convert voyager over to the x86 quirks model ?

john stultz [RESEND][PATCH 0/2] Convert acked !GENERIC_TIME architectures to use the generic timekeeping core. ?

Core kernel code

Paul E. McKenney v2 Make hierarchical RCU less IPI-happy and add more tracing ?

Gautham R Shenoy sched: Nominate a power-efficient ILB ?

Arve Hjønnevåg Suspend block api ?

Development tools

Steven Rostedt [GIT PULL] TRACE_EVENT for modules ?

Steven Whitehouse GFS2: Tracing support for review ?

Device drivers

Tejun Heo CUSE: implement CUSE, take #3 ?

Tejun Heo FUSE: implement direct mmap, take#2 ?

Vladislav Bolkhovitin [ANNOUNCE]: A target driver for Marvell 88SE64xx(3G) and 88SE94xx(6G) SAS cards ?

Documentation

Theodore Ts'o Improvements to the tracing documentation ?

Filesystems and block I/O

Philipp Reisner DRBD: a block device for HA clusters ?

J. R. Okajima Aufs2 souce files ?

Eric W. Biederman File descriptor hot-unplug support ?

Adrian McMenamin VMUFAT filesystem - v2 ?

Oleg Drokin Attempt at "stat light" implementation ?

Mark Fasheh [RFC] vfs: 'stat light' fstatat flags ?

Ryo Tsuruta dm-ioband: I/O bandwidth controller ?

Andrea Righi cgroup: io-throttle controller (v13) ?

Chris Mason ext3 data=guarded was Re: [GIT PULL] Ext3 latency fixes ?

Chris Mason ext3 data=guarded v3 ?

Memory management

Izik Eidus ksm - dynamic page sharing driver for linux v3 ?

Wu Fengguang context readahead for concurrent IO ?

Wu Fengguang context readahead for concurrent IO take 2 ?

Dan Malek Memory usage limit notification addition to memcg ?

Balbir Singh [PATCH] Add file based RSS accounting for memory resource controller (v2) ?

Balbir Singh [RFC][PATCH] Add file RSS accounting to the memory resource controller ?

Networking

Tom Herbert Software receive packet steering ?

Stephen Hemminger netfilter: use per-cpu spinlock rather than RCU ?

Johannes Berg rfkill: rewrite ?

Security-related

Tetsuo Handa LSM: Add security_socket_post_accept() and security_socket_post_recv_datagram(). ?

Virtualization and containers

Gregory Haskins virtual-bus ?

Ingo Molnar Xen updates for v2.6.30 ?

Benchmarks and bugs

Arjan van de Ven kerneloops.org oops/warning report for the week of April 11 2009 ?

Miscellaneous

Douglas Gilbert sg3_utils-1.27 available ?

Pablo Neira Ayuso Release of iptables-1.4.3.2 ?

Kay Sievers udev 141 release ?

Jon Masters ANNOUNCE: module-init-tools 3.7 release ?

Tejun Heo OSS Proxy 1.2 using CUSE ?

Frederic Weisbecker vsprintf: introduce %pf ?

H. Peter Anvin Novafora relicenses Transmeta sparse copyrights under the MIT license ?

Hans de Goede libv4l release: 0.5.97: the whitebalance release! ?

Page editor: Jonathan Corbet
Next page: Distributions>>