Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.30-rc3, released on April 21. "The diffstat really shows lots of small one-liners and two-liners, although there are areas that are getting bigger patches (ignoring the bulky but uninteresting arm defconfig updates): some x86 updates, some block IO scheduling fixes, splice cleanups and fixes, and a number of driver changes (sound, networking, staging, usb)." The short-form changelog is in the announcement, or see the full changelog for all the details.

The current stable 2.6 release remains 2.6.29.1; there have been no stable 2.6 updates since April 2.

For the fans of extreme stability, though, 2.4.37.1 was released on April 19. "Most of these fixes concern minor security issues which have been backported from 2.6 (mostly local DoSes). In my opinion, only people with local users might consider upgrading, if those people still exist!"

Comments (2 posted)

Kernel development news

Quotes of the week

The number of contributors who can write meaningful changelogs or who can be taught to write really good changelogs is very, very low. I'd guesstimate somewhere around 5% of all Linux contributors. (The guesstimation is probably on the more generous side.)

-- Ingo Molnar

No subject should ever contain the word "trivial". If it's really trivial, you can sum it up in the subject and we'll know it's trivial. Plus the diffstat shows it. 'trivial' is propaganda to sneak a patch into -rc7.

-- Rusty Russell

In the past 15 years of Linux we've invested a lot of time and effort into working around and dealing with compiler crap. We wasted a lot of opportunities waiting years for sane compiler features to show up. We might as well have invested that effort into building our own compiler and could stop bothering about externalities.

-- Ingo Molnar

Comments (11 posted)

In search of the perfect changelog

By Jonathan Corbet
April 22, 2009

When kernel developers engage in an extended discussion on the writing of changelogs for patches, one might well conclude that they have run out of useful things to do. But arguments over changelogs are not the same as spelling or grammar flames. In an environment where 10,000 or so changes are merged in every three-month development cycle, developers need all the help they can get to understand what is going into the kernel. Poorly-described patches are harder to understand, and harder to find when searching the history for something specific. So getting changelogs right helps the development process - and the kernel - as a whole.

It all started innocently enough; Linus was engaging in a routine patch flaming when he encountered one of the "Impact:" tags that some developers (especially those working with Ingo Molnar's trees) have adopted in recent months:

    Impact: clarify and extend confusing API

Suffice to say that he was not much impressed with it:

And what the hell is up with these bogus "Impact:" things? Who started doing that, and why? If your single-line explanation at the top is not good enough, and your multi-line explanation isn't clear enough, then you should fix the OTHER parts, not add that _idiotic_ "Impact" statement.

From there, the extended conversation focused on two related topics: the value of "impact" tags and how to write better changelogs in general. On the former, the primary (but not only) proponent of these tags is Ingo Molnar, who cites several benefits from their use. Using these tags, he claims, forces developers to write smaller patches which can be adequately described in a single line. They give subsystem maintainers an easy way to assess the changes made by a set of patches and their associated risk; they also make it easier to review a patch against its declared "impact." These tags are also said to force a certain clarity of thought, making developers think through the consequences of a change.

Most of these arguments leave "Impact:" detractors unmoved, though. Rather than add yet another tag to a patch, they would prefer to see developers just write better changelogs from the outset. In a properly-documented patch, the new tag is just irrelevant. Andrew Morton said:

I'm getting quite a few Impact:s now and I must say that the Impact: line is always duplicative of the Subject:. Except in a few cases, and that's because the Subject: sucked.

Ingo disputed that claim at length, needless to say. But he takes things further by stating that, while better changelogs would certainly be desirable, they are not a practical goal. According to Ingo, most developers are simply not capable of writing good changelogs. Language barriers and such often are part of this problem, but it goes deeper: most developers simply lack the writing skills needed to write clear and concise changelogs. This fact of life, as Ingo sees it, cannot really be changed, but most developers can, at least, be trained to write a reasonable impact tag.

It is probably fair to say that most developers do not see themselves as being disabled in this way. That said, it is also fair to say that a lot of patches go into the mainline with unhelpful changelogs. That can probably be changed - to an extent at least - through pressure from maintainers and a better understanding of what makes a good changelog. In an attempt to help, your editor has proposed a brief addition to Documentation/development-process:

Writing good changelogs is a crucial but often-neglected art; it's worth spending another moment discussing this issue. When writing a changelog, you should bear in mind that a number of different people will be reading your words. These include subsystem maintainers and reviewers who need to decide whether the patch should be included, distributors and other maintainers trying to decide whether a patch should be backported to other kernels, bug hunters wondering whether the patch is responsible for a problem they are chasing, users who want to know how the kernel has changed, and more. A good changelog conveys the needed information to all of these people in the most direct and concise way possible.

To that end, the summary line should describe the effects of and motivation for the change as well as possible given the one-line constraint. The detailed description can then amplify on those topics and provide any needed additional information. If the patch fixes a bug, cite the commit which introduced the bug if possible. If a problem is associated with specific log or compiler output, include that output to help others searching for a solution to the same problem. If the change is meant to support other changes coming in later patch, say so. If internal APIs are changed, detail those changes and how other developers should respond. In general, the more you can put yourself into the shoes of everybody who will be reading your changelog, the better that changelog (and the kernel as a whole) will be.

Other possible additions have been proposed by Ted Ts'o and Paul Gortmaker. Of course, all of these patches are based on the optimistic notion that developers will actually read the documentation.

One could argue that the kernel community is rather late in getting around to this kind of discussion. That could be said to be par for the course; in the pre-BitKeeper era (i.e. up to February, 2002), there was almost no tracking of individual changes into the kernel at all. That the fine points of changelogging are being discussed a mere seven years later suggests things are going in the right direction. The level of professionalism in the kernel community has been on the rise for a long time; this process is likely to continue. Whether or not some variant on the impact tag is used in the future, one can assume that the quality of changelogs will, as a whole, be better.

Comments (10 posted)

The slow work mechanism

By Jonathan Corbet
April 22, 2009

Many years ago, your editor heard Van Jacobson state that naming an algorithm "slow start" was one of the biggest mistakes he had ever made. The name refers to the technique of ramping up transmit rates slowly until the carrying capacity of the connection is determined. But others just saw "slow" and complained that they didn't want their connections to be slow. The fact that "slow start" made the net faster was lost on them. One might wonder if David Howells's "slow work" mechanism - merged for 2.6.30 - could run into similar problems; no kernel developer wants things to run slowly. But, as with slow start, running things slowly is not the point.

Slow work is a thread pool implementation - yet another thread pool, one might say. The kernel already has workqueues and the asynchronous function call infrastructure; the distributed storage (DST) module added to the -staging tree for 2.6.30 also has a thread pool hidden within it. Each of these pools is aimed at a different set of uses. Workqueues provide per-CPU threads dedicated to specific subsystems, while asynchronous function calls are optimized for specific ordering of tasks. Slow work, instead, looks like a true "batch job" facility which can be used by kernel subsystems to run tasks which are expected to take a fair amount of time in their execution.

A kernel subsystem which wants to run slow work jobs must first declare its intention to the slow work code:

    #include <linux/slow-work.h>

    int slow_work_register_user(void);

The call to slow_work_register_user() ensures that the thread pool is set up and ready for work - no threads are created before the first user is registered. The return value will be either zero (on success) or the usual negative error code.

Actual slow work jobs require the creation of two structures:

    struct slow_work;

    struct slow_work_ops {
	int (*get_ref)(struct slow_work *work);
	void (*put_ref)(struct slow_work *work);
	void (*execute)(struct slow_work *work);
    };

The slow_work structure is created by the caller, but is otherwise opaque. The slow_work_ops structure, created separately, is where the real work gets done. The execute() function will be called by the slow work code to get the actual job done. But first, get_ref() will be called to obtain a reference to the slow_work structure. Once the work is done, put_ref() will be called to return that reference. Slow work items can hang around for some time after they have been submitted, so reference counting is needed to ensure that they are freed at the right time. The implementation of get_ref() and put_ref() functions is not optional.

In practice, kernel code using slow work will create its own structure which contains the slow_work structure and some sort of reference-counting primitive. The slow_work structure must be initialized with one of:

    void slow_work_init(struct slow_work *work, const struct slow_work_ops *ops);
    void vslow_work_init(struct slow_work *work, const struct slow_work_ops *ops);

The difference between the two is that vslow_work_init() identifies the job as "very slow work" which can be expected to run (or sleep) for a significant period of time. The documentation suggests that writing to a file might be "slow work," while "very slow work" might be a sequence of file lookup, creation, and mkdir() operations. The slow work code actually prioritizes "very slow work" items over the merely slow ones, but only up to the point where they use 50% (by default) of the available threads. Once the maximum number of very slow jobs is running, only "slow work" tasks will be executed.

Actually getting a slow work task running is done with:

    int slow_work_enqueue(struct slow_work *work);

This function queues the task for running. It will succeed unless the associated get_ref() function fails, in which case -EAGAIN will be returned.

Slow work tasks can be enqueued multiple times, but no count is kept, so a task enqueued several times before it begins to execute will only run once. A task which is enqueued while it is running is indeed put back on the queue for a second execution later on. The same task is guaranteed to not run on multiple CPUs simultaneously.

There is no way to remove tasks which have been queued for execution, and there is no way (built into the slow work mechanism) to wait for those tasks to complete. A "wait for completion" functionality can certainly be created by the caller if need be. The general assumption, though, seems to be that slow work items can be outstanding for an indefinite period of time. As long as tasks with a non-zero reference count exist, any resources they depend on need to remain available.

There are three parameters for controlling slow work which appear under /proc/sys/kernel/slow-work: min-threads (the minimum size of the thread pool), max-threads (the maximum size), and vslow-percentage (the maximum percentage of the available threads which can be used for "very slow" tasks). The defaults allow for between two and four threads, 50% of which can run "very slow" tasks.

The only user of slow work in the 2.6.30 kernel is the FS-Cache file caching subsystem. There is a clear need for thread pool functionality, though, so it would not be surprising to see other users show up in future releases. What might be more surprising (though desirable) would be a consolidation of thread pool implementations in a future development cycle.

Comments (1 posted)

DRBD: a distributed block device

April 22, 2009

This article was contributed by Goldwyn Rodrigues

The three R's of high availability are Redundancy, Redundancy and Redundancy. However, on a typical setup built with commodity hardware, it is not possible to add redundancy beyond a certain limit to increase the number of 9's after your current uptime percentage (ie 99.999%). Consider a simple example: an iSCSI server with the cluster nodes using a distributed filesystem such as GFS2 or OCFS2. Even with redundant power supplies and data channels on the iSCSI storage server, there still exists a single point of failure: the storage.

The Distributed Replicated Block Device (DRBD) patch, developed by Linbit, introduces duplicated block storage over the network with synchronous data replication. If one of the storage nodes in the replicated environment fails, the system has another block device to rely on, and can safely failover. In short, it can be considered as an implementation of RAID1 mirroring using a combination of a local disk and one on a remote node, but with better integration with cluster software such as heartbeat and efficient resynchronization with the ability to exchange dirty bitmaps and data generation identifiers. DRBD currently works only on 2-node clusters, though you could use a hybrid version to expand this limit. When both nodes of the cluster are up, writes are replicated and sent to both the local disk and the other node. For efficiency reasons, reads are fetched from the local disk.

The level of data coupling used depends on the protocol chosen:

Protocol A: Writes are considered to complete as soon as the local disk writes have completed, and the data packet has been placed in the send queue for the peers. In case of a node failure, data loss may occur because the data to be written to remote node disk may still be in the send queue. However, the data on the failover node is consistent, but not up-to-date. This is usually used for geographically separated nodes.

Protocol B: Writes on the primary node are considered to be complete as soon as the local disk write has completed and the replication packet has reached the peer node. Data loss may occur in case of simultaneous failure of both participating nodes, because the in-flight data may not have been committed to disk.

Protocol C: Writes are considered complete only after both the local and the remote node's disks have confirmed the writes are complete. There is no data loss, so this is a popular schema for clustered nodes, but the I/O throughput is dependent on the network bandwidth.

DRBD classifies the cluster nodes as either "primary" or "secondary." Primary nodes can initiate modifications or writes whereas secondary nodes cannot. This means that a secondary DRBD node does not provide any access and cannot be mounted. Even read-only access is disallowed for cache coherency reasons. The secondary node is present mainly to act as the failover device in case of an error. The secondary node may become primary depending on the network configuration. Role assignment and designation is performed by the cluster management software.

There are different ways in which a node may be designated as primary:

Single Primary: The primary designation is given to one cluster member. Since only one cluster member manipulates the data, this mode is useful with conventional filesystems such as ext3 or XFS.

Dual Primary: Both cluster nodes can be primary and are allowed to modify the data. This is typically used in cluster aware filesystems such as ocfs2. DRBD for the current release can support a maximum of two primary nodes in a basic cluster.

Worker Threads

A part of the communication between nodes is handled by threads to avoid deadlocks and complex design issues. The threads used for communication are:

drbd_receiver: handles incoming packets. On the secondary node, it allocates buffers, receives data blocks and issues write requests to the local disk. If it receives a write barrier, it sleeps until all pending write requests have been finished.

drbd_sender: Sender thread for data blocks in response to a read request. This is done in a thread other than drbd_receiver, to avoid distributed deadlocks. If a resynchronization process is running, its packets are generated by this thread.

drbd_asender: Acknowledgment sender. Hard drive drivers are informed of request completions through interrupts. However, sending data over the network in an interrupt callback routine may block the handler. So, the interrupt handler places the packet in a queue which is picked up by this thread and sent over the network.

Failures

DRBD requires a small reserve area for metadata, to handle post failure operations (such as synchronization) efficiently. This area can be configured either on a separate device (external metadata), or within the DRBD block device (internal metadata). It holds the metadata with respect to the disk including the activity log and the dirty bitmap (described below).

Node Failures

If a secondary node dies, it does not affect the system as a whole because writes are not initiated by the secondary node. If the failed node is primary, the data yet to be written to disk, but for which completions are not received, may get lost. To avoid this, DRBD maintains an "activity log," a reserved area on the local disk which contains information about write operations which have not completed. The data is stored in extents and is maintained in a least recently used (LRU) list. Each change of the activity log causes a meta data update (single sector write). The size of the activity log is configured by the user; it is a tradeoff between minimizing updates to the meta data and the resynchronization time after the crash of a primary node.

DRBD maintains a "dirty bitmap" in case it has to run without a peer node or without a local disk. It describes the pages which have been dirtied by the local node. Writes to the on-disk dirty bitmap are minimized by the activity log. Each time an extent is evicted from the activity log, the part of the bitmap associated with it which is no longer covered by the activity log is written to disk. The dirty bitmaps are sent over the network to communicate which pages are dirty should a resynchronization become necessary. Bitmaps are compressed (using run-length encoding) before sending on the network to reduce network overhead. Since most of the of the bitmaps are sparse, it proves to be pretty effective.

DRBD synchronizes data once the crashed node comes back up, or in response to data inconsistencies caused by an interruption in the link. Synchronization is performed in a linear order, by disk offset, in the same disk layout as the consistent node. The rate of synchronization can be configured by the rate parameter in the DRBD configuration file.

Disk Failures

In case of local disk errors, the system may choose to deal with it in one of the following ways, depending on the configuration:

detach: Detach the node from the backing device and continue in diskless mode. In this situation, the device on the peer node becomes the main disk. This is the recommended configuration for high availability.

pass_on: Pass the error to the upper layers on a primary node. The disk error is ignored, but logged, when the node is secondary.

call-local-io-error: Invokes a script. This mode can be used to perform a failover to a "healthy" node, and automatically shift the primary designation to another node.

Data Inconsistency issues

In the dual-primary case, both nodes may write to the same disk sector, making the data inconsistent. For writes at different offset, there is no synchronization required. To avoid inconsistency issues, data packets over the network are numbered sequentially to identify the order of writes. However, there are still some corner-case inconsistency problems the system can suffer from:

Simultaneous writes by both nodes at the same time. In such a situation, one of the node's writes are discarded. One of the primary nodes is marked with the "discard-concurrent-writes" flag, which causes it to discard write requests from the other node when it detects simultaneous writes. The node with discard-concurrent-writes flag set, sends a "discard ACK" to other nodes informing them that the write has been discarded. The other node, on detecting the discard ACK, writes the data from first node to keep the drives consistent.

Local request while remote request in flight This can happen when the disk latency exceeds the network latency. The local node writes to a given block, sending the write operation to the other node. The remote node then acknowledges the completion of the request and sends a new write of its own to the same block - all before the local write has completed. In this case, the local node keeps the new data write request on hold until the local writes are complete.

Remote request while local request is still pending: this situation comes about if the network reorders packets, causing a remote write to a given block to arrive before the acknowledgment of a previous, locally-generated write. Once again, the receiving node will simply hold the new data until the ACK is received.

Conclusion

DRBD is not the only distributed storage implementation under development. The implementation of Distributed Storage (DST) contributed by Evgeniy Polyakov and accepted in staging tree takes a different approach. DRBD is limited to 2-node active clusters, while DST can have larger numbers of nodes. DST works on client-server model, where the storage is at the server end, whereas DRBD is peer-to-peer based, and designed for high-availability as compared to distributing storage. DST, on the other hand, is designed for accumulative storage, with storage nodes which can be added as needed. DST has a pluggable module which accepts different algorithms for mapping the storage nodes into a cumulative storage. The algorithm chosen can be mirroring which would serve the same basic capability of replicated storage as DRBD.

DRBD code is maintained in the git repository at git://git.drbd.org/linux-2.6-drbd.git, under the "drbd" branch. It contains the minor review comments posted on LKML incorporated after the patchset was released by Philipp Reisner. For further information, see the several PDF documents mention in the DRBD patch posting.

Comments (10 posted)

Patches and updates

Kernel trees

Linus Torvalds Linus 2.6.30-rc3 ?

Thomas Gleixner 2.6.29.1-rt8 ?

Willy Tarreau Linux 2.4.37.1 ?

Architecture-specific

Fenghua Yu Intel IOMMU Pass Through Support ?

Core kernel code

Arun R Bharadwaj timers: Framework for migration of timers ?

Paul E. McKenney v3 RCU implementation with fast grace periods ?

Development tools

Dan Carpenter smatch 1.52 released ?

Larry Woodman mm tracepoints update ?

Device drivers

Jeff Garzik AHCI updates: Marvell AHCI PATA works; pata_marvell fate? ?

Peter Holik [Resend][PATCH] usb driver for intellon int51x1 based PLC like devolo dlan duo ?

Greg KH driver core patches for 2.6.31-rc2 ?

Atsushi Nemoto DMA: TXx9 Soc DMA Controller driver (v3) ?

Documentation

Michael Kerrisk man-pages-3.21 is released ?

Tilman Schmidt Documentation/isdn/INTERFACE.CAPI ?

Paul Gortmaker documentation: list common guidelines for commit log content ?

Filesystems and block I/O

Ryo Tsuruta bio-cgroup: Introduction ?

Andrea Righi cgroup: io-throttle controller (v14) ?

Memory management

Izik Eidus ksm - dynamic page sharing driver for linux v4 ?

Mel Gorman Cleanup and optimise the page allocator V6 ?

Networking

Stephen Hemminger netfilter: use per-cpu reader-writer lock (v0.7) ?

Jiri Pirko [PATCH 1/3] net: introduce a list of device addresses dev_addr_list (v2) ?

Virtualization and containers

Gregory Haskins virtual-bus ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.30-rc2-git2: Reported regressions from 2.6.29 ?

Miscellaneous

Karel Zak util-linux-ng v2.15-rc2 ?

David VomLehn Wait for console to become available, ver 3 ?

Page editor: Jonathan Corbet
Next page: Distributions>>