Infrastructure unification in the block layer

October 7, 2009

This article was contributed by Neil Brown

For many years, Linux has had two separate subsystems for managing indirect block devices: virtual storage devices which combine storage from one or more other devices in various ways to provide either improved performance, flexibility, capacity, or redundancy. These two are DM (which stands for Device Mapper) and MD (which might stand for Multiple Devices or Meta Disk and is only by pure coincidence the reverse of DM).

For nearly as long there have been suggestions that having two frameworks is a waste and that they should be unified. However little visible effort has been made toward this unification and such efforts as there might have been have not yielded any lasting success. The most united thing about the two is that they have a common directory in the Linux kernel source tree (drivers/md); this is more a confusion than a unification. The two subsystems have both seen ongoing development side by side, each occasionally gaining functionality that the other has and so, in some ways, becoming similar. But similarity is not unity, rather it serves to highlight the lack of unity as it is no longer function that keeps the two separate, only form.

Exploring why unification has never happened would be an interesting historical exercise that would need to touch on the personalities of the people involved, the drift in functionality between the two systems which started out with quite different goals, the differing perceptions of each by various members of the community, and the technological differences that would need to be resolved. Not being an historian, your author only feels competent to comment on that last point, and, as it is the one where a greater understanding is most likely to aid unification, this article will endeavor to expose the significant technological issues that keep the two separate. In particular, we will explore the weaknesses in each infrastructure. Where a system has strengths, they are likely to be copied, thus creating more uniformity. Where it has weaknesses, they are likely to be assiduously avoided by others, thus creating barriers.

Trying to give as complete a picture as possible, we will explore more than just DM and MD. Loop, NBD, and DRBD provide similar functionality behind their own single-use infrastructure; exploring them will ensure that we don't miss any important problems or needs.

The flaws of MD

Being the lead developer of MD for some years, your author feels honour bound to start by identifying weaknesses in that system.

One of the more ugly aspects of MD is the creation of a new array device. This is triggered by simply opening a device special file, typically in /dev. In the old days, when we had a fairly static /dev directory, this seemed a reasonable approach. It was simply necessary to create a bunch of entries in /dev (md0, md1, md2, ...) with appropriate major and minor numbers at the same time that the other static content of /dev was created. Then, whenever one of those entries was opened, the internal data structures would spontaneously be created so that the details of the device could be filled in.

However with the more modern concept of a dynamic /dev, reflecting the fact that the set of devices attached to a given system is quite fluid, this doesn't fit very well. udev, which typically manages /dev, only creates entries for devices that the kernel knows about. So it will not create any md devices until the kernel believes them to exist. They won't exist until the device file has been created and can be opened - a classic catch-22 situation.

mdadm, the main management tool for MD, works around this problem by creating a temporary device special file just so that it can open it and thereby create the device. This works well enough, but is, nonetheless, quite ugly. The internal implementation is particularly ugly and it was only relatively recently that the races inherent in destroying an MD device (which could be recreated at any moment by user space) were closed so that MD devices don't have to exist forever.

A closely related problem with MD is that the block device representing the array appears before the array is configured or has any data in it. So when udev first creates /dev/md0, an attempt to open and read from it, (to find out if a filesystem is stored there, for example) will find no content. It is only after the component devices have been attached to the array and it has been fully configured that there is any point in trying to read data from the array.

This initial state, where the device exists but is empty, is somewhat like the case of removable-media devices, and can be managed along the same lines as those: we could treat the array as media that can spontaneously appear. However MD is, in other ways, quite unlike removable media (there is no concept of "eject") and it would generally cause less confusion if MD devices appeared fully configured so they looked more like regular disk drive devices.

A problem that has only recently been addressed is the fact that MD manages the metadata for the arrays internally. The kernel module knows all about the layout of data in the superblock and updates it as appropriate. This makes it easy to implement, but not so easy to extend. Due to the lack of any real standard, there are many vendor-specific metadata layouts that all can be used to describe the same sort of array. Supporting all of those in the kernel would unnecessarily bloat the kernel, and supporting them in user space requires information about required updates to be reported to user space in a reliable way.

As mentioned, this problem has recently been addressed, so it is now quite possible to manage vendor-specific metadata from user space. It is still worth noting, though, as one of the problems that has stood in the way of earlier attempts at DM/MD integration: DM does not manage metadata at all, leaving it up to user-space tools.

The final flaw in MD to be exposed here is the use and nature of the ioctl() commands that are used to configure and manage MD arrays. The use of ioctl() has been frowned upon in the Linux community for some years. There are a number of reasons for this. One is that strace cannot decode newly-defined ioctls, so the use of ioctl() can make a program's behaviour harder to observe. Another is that it is a binary interface (typically passing C structures around) and so, when Linux is configured to support multiple ABIs (e.g. a 32bit and a 64bit version), there is often a need to translate the binary structure from one ABI to the other (see the nearly 3000 lines in fs/compat_ioctl.c).

In the case of MD, the ioctl() interface is not very extensible. The command for configuring an array allows only a "level", a "layout", and a "chunksize" to be specified. This works well enough for RAID0, RAID1, and RAID5, but even with RAID10 we needed to encode multiple values into the "layout" field which, while effective, isn't elegant.

In the last few years MD has grown a separate configuration interface via a collection of attribute files exposed in sysfs. This is much more extensible, and there are a growing number of features of MD which require the sysfs interface. However even here there is still room for improvement. The MD attribute files are stored in a subdirectory of the block device directory (e.g. /sys/block/md0/md/). While this seems natural, it entrenches the above-mentioned problem that the block device must exist before the array can be configured. If we wanted to delay creation of the block device until the array is ready to serve data, we would need to store these attribute files elsewhere in sysfs.

The failings of DM

DM has a very different heritage than MD and, while it shares some of the flaws of MD, it avoids others.

DM devices do not need to exist in /dev before they can be created. Rather there is a dedicated "character device" which accepts DM ioctl() commands, including the command to create a new device. Thus, the catch-22 problem from which MD suffers, is not present in DM. It has been suggested that MD should take this approach too. However, while it does solve one problem, it still leaves the problem of using ioctl(). There doesn't seem a lot of point making significant change to a subsystem unless the result avoids all known problems. So while waiting for a perfect solution, no such small steps have been made to bring MD and DM closer together.

The related issue of a block device existing before it is configured is still present in DM, though separating the creation of the DM device from the creation of the block device would be much easier in DM. This is because, as mentioned, with DM all configuration happens over the character device whereas with MD, the configuration happens via the block device itself, so it must exist before it can be configured.

While DM also uses ioctl() commands, which could be seen as a weakness, the commands chosen are much more extensible than those used by MD. The ioctl() command to configure a device essentially involves passing a text string to the relevant module within DM, and it interprets this string in any way it likes. So DM is not limited to the fields that were thought to be relevant when DM was first designed.

Metadata management with DM is very different than with MD. In the original design, there was never any need for the kernel module to modify metadata, so metadata management was left entirely in user space where it belongs. More recently, with RAID1 and RAID5 (which is still under development), the kernel is required to synchronously update the metadata to record a device failure. This requires a degree of interaction between the kernel and user space which has had to be added.

The main problem with the design of DM is the fact that it has two layers: the table layer and the target layer. This undoubtedly comes from the original focus of DM, which was logical volume management (LVM), and it fits that focus quite well. However, it is an unnecessary layering and just gets in the way of non-LVM applications.

A "target" is a concept internal to DM, which is the abstraction that each different module presents. So striping, raid1, multipath, etc. each present a target, and these targets can be combined via a table into a block device.

A "table" is simply a list of targets each with a size and an offset. This is analogous to the "linear" module in MD or what is elsewhere described as concatenation. The targets are essentially joined end-to-end to form a single larger block device.

This contrasts with MD where each module - raid0, raid1, or multipath, for example - presents a block device. This block device can be used as-is, or it can be combined with others, via a separate array, into a single larger block device.

To highlight the effect of this layering a little more, suppose we were to have two different arrays made of a few devices. In one array we want the data striped across the devices. In the other we lay the data out filling the first device first and then moving on to the next device. With MD, the only difference between these two would be the choice of "raid0" or "linear" as the module to manage them. With DM, the first step would involve including all the devices in a single "stripe" target, and then placing that target as the sole entry in a table. The second would involve creating a number of "linear" targets, one for each device, and then combining them into a table with multiple entries.

Having this internal abstraction of a "target" serves to insulate and isolate DM from the block device layer, which is the common abstraction used by other virtual devices. A good example of this separation is the online reconfiguration functionality that DM provides. The boundary between the table and the targets allows DM to capture new requests in the table layer while allowing the target layer to drain and become idle, and then to potentially replace all the targets with different targets before releasing the requests that have been held back.

Without that internal "target" layer, that functionality would need to be implemented in the block layer on its boundary with the driver code (i.e. in generic_make_request() and bio_endio()). Doing this would be more effort (i.e. DM would not benefit from insulation) and it would then be more generally useful (i.e. DM would not be so isolated). Many people have wanted to be able to convert a normal device into a degraded RAID1 "array" or to enable multipath support on a device without first unmounting the filesystem which was mounted directly from one of the paths. If online reconfiguration were supported at the block layer level, these changes would become possible.

The difference of DRBD

DRBD, the Distributed Replicated Block Device, is the most complex of the virtual block devices that do not aim to provide a general framework. It is not yet included in the mainline, but it could yet be merged in 2.6.33.

Its configuration mechanism is similar to that of DM in a number of ways. There is a single channel which can be used to create and then manage the block devices. The protocol used over this channel is designed to be extensible, though the current definitions are very much focused around the particular needs of DRBD (as would be expected), so how easy it might be to extend to a different sort of array is not immediately clear.

Where DM uses ioctl() with string commands over a dedicated character device, DRBD uses a packed binary protocol over a netlink connection. This is essentially a socket connection between the kernel module and a user-space management program which carries binary encoded messages back and forth. This is probably no better or worse than ioctl(); it is simply different. Presumably it was chosen because there is general bad feeling about ioctl(), but no such bad feeling about netlink. Linus, however, doesn't seem keen on either approach.

DRBD appears to share metadata management between the kernel module and user space. Metadata which describes a particular DRBD configuration is created and interpreted by user-space tools and the information that is needed by the kernel is communicated over the netlink socket. DRBD uses other metadata to describe the current replication state of the system - which blocks are known to be safely replicated and which are possibly inconsistent between the replicas for some reason. This metadata (an activity log and a bitmap) is managed by the kernel, presumably for performance reasons.

This sharing of responsibility makes a lot of sense as it allows the performance-sensitive portions to remain in the kernel but still leaves a lot of flexibility to support different metadata formats. This approach could be improved even more by making the bitmap and activity log into independent modules that can be used by other virtual devices. Each of DM, MD, and DRBD have very similar similar mechanisms for tracking inconsistencies between component devices; this is possibly the most obvious area where sharing would be beneficial.

Loop, NBD and the purpose of infrastructure

Partly to emphasize the fact that it isn't necessary to use a framework to have a virtual block device, loop and NBD (the Network Block Device) are worth considering. While loop doesn't appear to aim to provide a framework for a multiplicity of virtual devices, it nonetheless combines three different functions into one device. It can make a regular file look like a block device, it can provide primitive partitioning of a different block device, and it can provide encryption and decryption so that an encrypted device can be accessed. Significantly, these are each functions that were subsequently added to DM, thus highlighting the isolating effect of the design of DM.

NBD is much simpler in that is has just one function: it provides a block device for which all I/O requests are forwarded over a network connection to be serviced on - normally - a different host. It is possibly most instructive as an example of a virtual block device that doesn't need any surrounding framework or infrastructure.

Two areas where DM or MD devices make use of an infrastructure, while Loop and NBD need to fend for themselves, are in the creation of new devices and the configuration of those devices. NBD takes a very simple approach of creating a predefined number of devices at module initialization time and not allowing any more. Loop is a little more flexible and uses the same mechanism as MD, largely provided by the block layer, to create loop devices when the block device special file is opened. It does not allow these to be deleted until the module is unloaded, usually at system shutdown time. This architecture suggests that some infrastructure could be helpful for these drivers, and that the best place for that infrastructure could well be in the block layer, and thus shared by all devices.

For configuration, both Loop and NBD use a fairly ad hoc collection of ioctl() commands. As we have already observed, this is both common and problematic. They could both benefit from a more standardized and transparent configuration mechanism.

It might be appropriate to ask at this point why there is any need for subsystem infrastructure such as DM and MD. Why not simply follow the pattern seen in loop, NBD and DRDB and have a separate block device driver for each sort of virtual block device? The most obvious reason is one that doesn't really apply any more. At the time when MD and DM were being written there was a strong connection between major device numbers and block device drivers. Each driver needed a separate major number. Loop is 7, NBD is 43, DRBD is 147, and MD is 9. DM doesn't have a permanently allocated number, it chooses a spare number when the module is loaded, so it usually gets 254 or 253.

Furthermore, at that time, the number of available major numbers was limited to 255 and there was danger of running out. Allocating one major number for RAID0, one for LINEAR, one for RAID1 and so forth would have looked like a bit of a waste, so getting one for MD and plugging different personalities into the one driver might have been a simple matter of numerical economy. Today, we have many more major numbers available, and we no longer have a tight binding between major numbers and device drivers - a driver simply claims whichever device numbers it wants at any time, when the module is loaded, or when a device is created.

A second reason is the fact that all the MD personalities envisioned at the time had a lot in common. In particular they each used a number of component devices to create a larger device. While creating a midlayer to encapsulate this functionality might be a mistake, it is a very tempting step and would seem to make implementation easier.

Finally, as has been mentioned, having a single module which defines its own internal interfaces can provide a measure of insulation from other parts of the kernel. While this was mentioned only in the context of DM, it is by no means absent from MD. That insulation, while not necessarily in the best interests of the kernel as a whole, can make life a lot easier for the individual developer.

None of these reasons really stand up as defensible today, though some were certainly valid in the past. So it could be that, rather than seeking unification of MD and DM, we should be seeking their deprecation. If we can find a simple approach to allow different implementations of virtual block devices to exist as independent drivers, but still maintain all the same functionality as they presently have, that is likely to be the best way forward.

Unification with the device model

This brings us to the Linux device model. While there may be no real need to unify DM with MD, the devices they create need to fit into the unifying model for devices which we call the "device model" and which is exposed most obviously through various directory trees in sysfs. The device model has a very broad concept of a "device." It is much more than the traditional Unix block and character devices; it includes busses, intermediate devices, and just about anything that is in any way addressable.

In this model it would seem sensible for there to be an "array" device which is quite separate from the "block" device providing access to the data in the array. This is not unlike the current situation where a SCSI bus has a child which is a SCSI target which, in turn, has a child which is a SCSI LUN (Logical UNit), and that device itself is still separate from the block device that we tend to think of as a "SCSI disk". This separation would allow the array to be created and configured before the block device can come into being, thus removing any room for confusion for udev.

The device model already allows for a bus driver to discover devices on that bus. In most cases this happens automatically during boot or at hotplug time. However, it is possible to ask a bus to discover any new devices, or to look for a particular new device. This last action could easily be borrowed to manage creation of virtual block devices on a virtual bus. The automatic scan would not find any devices, but an explicit request for an explicitly-named device could always succeed by simply creating that device. If we then configure the device by filling in attribute files in the virtual block device, we have a uniform and extensible mechanism for configuring all virtual block devices that fits with an existing model.

Again, the device model already allows for binding different drivers to devices as implemented in the different "bind" files in the /sys/bus directory tree. Utilizing this idea, once a virtual block device was "discovered" on the virtual block device bus, an appropriate driver could be bound to it that would interpret the attributes, possibly create files for extra attributes, and, ultimately, instantiate the block device.

Possibly the most difficult existing feature to represent cleanly in the device model is the on-line reconfiguration that DM and, more recently, MD provide. This allows control of an array to be passed from one driver to another without needing to destroy and recreate the block device (thus, for example, a filesystem can remain mounted during the transition). Doing this exchange in a completely general way would involve detaching a block device from one parent and attaching it to another. This would be complex for a number of reasons, one being the backing_dev_info structure, which creates quite a tight connection between a filesystem and the driver for the mounted block device.

Another weakness in the device model is that dependencies between devices are very limited - a device can be dependent on at most one other device, its parent. This doesn't fit very well with the observation that an array is dependent on all the components of the array, and that these components can change from time to time. Fortunately this weakness has already been identified and, hopefully, will be resolved in a way that also works for virtual block devices.

So, while there are plenty of issues that this model leaves unresolved, it does seem that unification with the device model holds the key to unification between MD and DM, along with any other virtual block devices.

So what is the answer?

Knowing that a problem is hard does not excuse us from solving it. With the growing interest in managing multiple devices together, as seen in DRBD and Btrfs, as well as in the increasing convergence in functionality between DM and MD, now might be the ideal time to solve the problems and achieve unification. Reflecting on the various problems and differences discussed above, it would seem that a very important step would be to define and agree on some interfaces; two in particular.

The first interface that we need is the device creation and configuration interface. It needs to provide for all the different needs of DM, MD, Loop, NBD, DRBD, and probably even Btrfs. It needs to be sufficiently complete such that the current ioctl() and netlink interfaces can be implemented entirely through calls into this new interface. It is almost certain that this interface should be exposed through sysfs and so needs to integrate well with the device model.

The second interface is the one between the block layer and individual block drivers. This interface needs to be enhanced to support all the functionality that a DM target expects of its interface with a DM table, and in particular it needs to be able to support hotplugging of the underlying driver while the block device remains active.

Defining and, very importantly, agreeing on these interfaces will go a long way towards achieving the long sought after unification.

Index entries for this article
GuestArticles	Brown, Neil

Infrastructure unification in the block layer

Posted Oct 8, 2009 2:53 UTC (Thu) by ncm (guest, #165) [Link] (1 responses)

Maybe I haven't been paying close enough attention... Once this unification is complete (and untold hundreds or thousands of lines of kernel code have been deleted), what will it be possible to do that cannot be done today? I.e., is this solely a worthwhile cleanup, or is it also clearing a logjam to open the way for something dramatic?

Infrastructure unification in the block layer

Posted Oct 8, 2009 6:45 UTC (Thu) by dlang (guest, #313) [Link]

among other things it will eliminate the current problem where you need to use DM if you need some features, but need to use MD if you need some other features. if you need both you are just out of luck

it will also make it much easier to have unified userspace tools. one of the common statements about ZFS is how it makes it so easy to create a filesystem and raid array at the same time. with linux the question would be which raid framework would it support, after this unification it would be able to do everything.

new features (say a checksumming driver) would only need to be implemented once and everything would be able to benifit

Infrastructure unification in the block layer

Posted Oct 8, 2009 4:16 UTC (Thu) by filteredperception (guest, #5692) [Link]

If you want to see what I consider a pretty cool use of swapping targets from a devicemapper device while it is mounted, check out the ZyX Rebootless LiveOS Installer, now available in the fedora-11 updates repository (zyx-liveinstaller).

http://viros.org/rebootless

Short story, its like the normal Fedora LiveCD/USB installer, except you don't need to reboot after installation completes to start/continue using your now installed/non-live system.

A few words on DRBD and user space - kernel interfaces

Posted Oct 8, 2009 13:41 UTC (Thu) by philipp (guest, #8960) [Link] (8 responses)

A few words on DRBD and user space - kernel interfaces

First of all: Neil, thanks for that excellent article.

As the section on DRBD is considerably shorten than the sections
on DM/MD, I can add a bit for DRBD here.

DRBD used to have a ioctl() based interface (drbd-0.7 releases and
earlier).
The main issue we stumbled across, was that this ioctl interface was not
designed with later extendability in mind. I.e. we required our users
to update the user land programs with the kernel module. -- Our code
enforced this, the module refused to talk to alder or newer user space
programs. Since at that time we where an out of mainline module, this
was never an issue for our users. The users where used to this, and this
was also expressed in the packages' dependencies.

The other issue we had appeared with kernel and user land running with
different word sizes, i.e. 64 bit kernel and 32 bit user land. While this
is not a principal issue with IOCTLs we got that wrong in the beginning.

As we realised at that point that ioctls are frowned upon in the kernel
community, and we had the plan to go mainline, I decided that we needed
a new interface.

As genetlink was not yet in the kernel, bare netlink is not usable for
external modules we ended up with using the connector. (Connector is
just a thin layer on top of netlink)

So, connector, seemed to be a good choice, since it is not IOCTL, and
it avoids the catch-22 issue mentioned by Neil. An a nice byproduct
is that netlink can also be used to inform userspace about random
events in the kernel.

Unfortunately it turned out that connector has its own issues, and
that Linus is not a friend of the whole netlink idea either.

DRBD's connector interface: The good stuff.

Not let me point our how DRBD's netlink interface is extensible.
The netlink packet is not based on a fixed layout (i.e. a C struct),
but it is a "tag list". Imagine it as a list labels and values.
Each typed (available types are: bit, int32, int54, sting/blob)
and each attributed being mandatory of optional. In the implementation
the labels are numbers, with the convention that such a number
will never be re-used.

On the kernel side, such a packet gets processed only if all the
label numbers of the mandatory tags are known, otherwise the
operation is refused, and the user is informed that the kernel
component is too old for the desired operation.

With that scheme it is possible to use older user space tools to
configure newer DRBD drivers. Using newer user space tools on older
kernels works as well, as long as you do not request any feature
not supported by the older DRBD driver.
That proofed to be fairly usable.

How we got it wrong again: call_usermodehelper()

At various events we call out to user space.

To give you an example: When we are up to start a resynchronisation
we call such an user helper, because the user may want to create a
snapshot of the block device on the resync target node just before
the resync begins. Of course there is also an user helper that
gets called after the resync finished, because the user might want
to drop that snapshot automatically.

The mistake was, that our user space tools return to the kernel an
error if the particular user space helper is not known. When we
introduced that new before-resync-target user space hook, we broke
installations updating to the new DRBD driver that kept the old
user space tools. Because when the before-resync-target handler
returns an error to the DRBD driver, it does not do the resync
but it abandons the connection to the peer.

We can avoid that in the future by more carefully define the return
code conventions of further, user space helpers.

New kid on the block: Configfs (or other vfs based approaches)

The idea seems intriguing at first. A new virtual file system, each
subsystem has its sub directory there. Let the user create new
sub directories in there, below the subsystem level. The kernel
side populates them with virtual files (=attributes), which in turn
can be tuned from user space.

When I try to envision how to use this interface for DRBD..

* Missing here is transactional semantics. In the ioctl()/netlink world,
you send a request towards your kernel driver, and your user space
tool gets a response back.

In case two instances of that user space tool gets invoked, and they
have to modify lots of attributes, they would step on each others toes.

* Quite frequently it is necessary to change multiple attributes in one
operation.

* Along the same lines is the issue of error reporting. Writing to
an attribute should fail with some errno, if an invalid value
gets written. In DRBD's netlink protocol we have about 150
error codes defined, mapping those to errno codes is not possible
in a sane way. It would just deliver an other number-space over
the errno channel.

In Configfs' documentation envision committable items, which are
currently unimplemented, handle only the issue of setting multiple
attributes as at creation time of the object and the error reporting
has to be done through errno of the rename system call.

For me the root of the issue is that the interface to the filesystem
was never intended to be an transactional interface.

I see the patchgroups interface presented by featherstitch project as
a clean way to add transactional semantics to a filesystem, thus that
could also bring sane transactional semantics to the configfs interface.
(See http://lwn.net/Articles/354861/)

So, I am suggesting to get the configuration interface right, bring
transactions to the filesystem in first. Sounds crazy? Maybe it is?
Maybe having that transaction semantics for the filesystem being an
important thing we currently miss in Linux!

-Philipp Reisner

A few words on DRBD and user space - kernel interfaces

Posted Oct 8, 2009 23:36 UTC (Thu) by neilbrown (subscriber, #359) [Link] (5 responses)

Hi Phillipp,
thanks for the extra historical background on DRBD. It does serve to highlight that getting interfaces "right" really is hard.

I don't believe configfs is a useful answer for anything. I believe the supposed distinction between it and sysfs is an imaginary distinction. It is a bit like the distinction between DM and MD - superficially different but fundamentally the same.

I think 'transaction semantics' a quite achievable in sysfs - I have them for some aspects of MD. The basic idea is that updating some attributes is not immediately effective, but requires a write to a 'commit' attribute.

E.g. I can change the layout, chunksize, and/or the number of devices in a RAID5 by updating any of those attributes, and then writing "reshape" to the 'sync_action' attribute.

This does raise the question of what should be seen when you read from one of these non-immediate attributes. One option is to read both values (old and new). Another is to have two attributes "X" and "X_new" - writing the 'commit' command copies all the X_new to X. I currently prefer the former.

Your concern about multiple userspace tools being invoked can, I believe, be answered by a userspace solution, probably involving a lockfile in /var/locks or /var/run or similar.

Getting notifications to userspace through sysfs is quite easy using 'poll'. Userspace then re-reads the attribute and decides what to do based on the current state.

I had noted that DRBD uses a lot of different error code and wondered about that - is it really necessary?
- some of them translate directly to 'standard' error codes
- some of them seem to be reporting that a request was illegal, in which case correctly working code should never have made the request, and a simple EINVAL will do.
- some seem to differentiate which part of a request was illegal (CSUMS_ALG vs VERIFY_ALG?? - I'm guessing here). With the proposed sysfs interface, you wouldn't need that differentiation because you could tell which part of the request was in error by the attribute that was being written to at the time.

So I'm not convinced there is really a need for a lot of error codes, particularly when the interface allows (and requires) a separate status report for each attribute changed.

Thanks,
NeilBrown

A few words on DRBD and user space - kernel interfaces

Posted Oct 17, 2009 18:39 UTC (Sat) by jageorge (guest, #61413) [Link] (4 responses)

Neil,
Please not the horror of more non-atomic interfaces to needfully atomic operations just to avoid ioctl(). As someone who has written sysfs scanners which sometimes result in bizarre side effects resulting from abuse of sysfs (including non-atomic setup activities) let's be clear about the problem with ioctl(). #1 It creates an unenforceable binary interface which tends to not work well with enhancements or architecture variations. #2 see #1. (BTW I tend to agree about configfs being another name for the same animal - sysfs). My proposition is using a sysfs handle to accept multiple elements in a single atomic operation. That data can be ascii'fied (or not) and involve name-value pairs or simply field (lf) separated data elements in a known order. Sure it violates the one element per handle rule, but for processing atomic operations all elements _must_ be presented atomically.

I do have one alternative in mind which actually can be considered a fork of your proposal, but the linux infrastructure to implement it is not yet in place. Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device. This solution would be atomic, non-public, and follow the recommended sysfs element setting process as far as possible... Ultimately a pretty cool solution to the crappy non-atomic or pubic aggregation problem and perhaps a good long term solution, but one way or the other your solution should have the atomic interface benefits of ioctl() without the binary limitations on portability.
Regards,
Jonathan

sysfs is dumb

Posted Oct 17, 2009 19:54 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (3 responses)

<rant>

Another reason people don't like ioctl is that it's not generically scriptable: to use an interface exposed by an ioctl, a C program must be written that can understands the appropriate structure definitions. Scripts can then only run these wrapper programs, and I suppose people didn't want to undertake the chore of wrapper writing. At first, sysfs seems to solve that problem, but the necessary filesystem structure is so hairy, and the ordering and atomicity requirements are so arcane, that people end up writing wrappers anyway! (Consider lspci and lsof.)

Serious question: how is sysfs better than sysctl? Both give you hierarchically-organized human readable ASCII-based cross-architecture key-value pairs that can be manipulated by scripts, but because sysctl is a single system call, there's at least a possibility of making atomic changes without disgusting hacks or having to implement a full filesystem transaction layer.

I don't see sysfs's filesystem interfaces as much of an advantage. You can grep sysctl output even more easily than you can grep /sys; and speaking of the name /sys: it's a de-facto standard. Mounting it elsewhere isn't particular useful except in the chroot case, and with a sysctl interface, you wouldn't have to mount anything at all!

Sure, you might be able to eventually do something Plan9-like and mount /sys and /proc over NFS, but the last mention I can find of anyone actually attempting that is from 1998. It doesn't seem terribly useful, and besides, and the security implications scare bejeesus out of me.

Besides: using sysctl is simpler! You don't have to worry about opening files, closing them, and so on. And the BSD people seem to get along fine without a sysfs, after all.

Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device.

This approach won't be particularly popular with people who like to manipulate sysfs with shell scripts.

sysfs is dumb - that depends

Posted Oct 18, 2009 1:37 UTC (Sun) by jageorge (guest, #61413) [Link] (2 responses)

Sysctl under Linux is just a wrapper around /proc... and I'm not saying that the BSD guys got it wrong, but sysfs IS the strategic direction already taken by Linux. However, there are clearly problems with the status quo especially when it comes to atomic operations. Both of my proposals (multi-element sysfs nodes, and process private staging sysfs directory) are compatible with the evolving direction of Linux system resource management from userspace.

The scriptability issue around my private staging tree proposal is easily addressable by using some sort of token (futex/mutex/semaphone) based approach to opening the staging directory instead of a purely PID based approach. Perhaps I'll try a kernel patch to illustrate what I mean ... if I can drum up some interest.

One way or the other private staging of atomic operations (whether ioctl() or some variation on my proposals) is essential for certain operations, and trying to avoid it _will_ result in race conditions many of which have security implications as well... now that I think about it token based private directories would be cool from a temporary directory perspective as well especially if the OS automatically reaped the result after the last token holder exited... so many cool implications... :-)

sysfs is dumb - that depends

Posted Oct 18, 2009 20:03 UTC (Sun) by quotemstr (subscriber, #45331) [Link] (1 responses)

sysfs IS the strategic direction already taken by Linux

It does seem that we're stuck with it for now, though it could be deprecated as many other interfaces have been.

So I agree, there's a need for atomic operations on sysfs. Your ideas seem over-engineered to me though. What's wrong with the following scheme? An application would create a temporary directory anywhere it liked. Under this temporary directory, an application would create a sysfs tree corresponding to the nodes to change, and after that, would write the name of the temporarily directory to a new special file, /sys/commit. If the commit is successful, the kernel would remove the temporary directory; if there's an error, it would leave the directory in place and return an error from write, or leave an error file in the temporary directory describing what went wrong.

This scheme doesn't require any new system calls or VFS infrastructure, and it's shell-script compatible.

sysfs is dumb - that depends

Posted Oct 19, 2009 14:40 UTC (Mon) by jageorge (guest, #61413) [Link]

Your suggest is essentially where I started, but there appear to be a couple of potential issues. 1. The commit from physical file system to sysfs seemed as if it could be expensive and/or racy. 2. Anything that exists in the normal file system environment is potentially vulnerable from a security/race (even multiple instances of the same monitoring/management software) standpoint.

Nevertheless, I don't want to over-complicate the implementation, and it is possible that there are already security facilities in the kernel which could serve to isolate something as process private. Furthermore, I agree that shell scripting should be relatively simple with any solution to this problem... to some extent that's one of the key ideas behind sysfs. An obvious first step would be to stage something without resolving the private view security question... perhaps even something like staging from a normal physical file system and using mv to flatten the directory structure into a text file which would be fed into a writable sysfs inode.

Basically the problem space is pretty clear (non-trivial atomic operations on IO devices) as is the high level of how to address it (sysfs nodes in the correct context which manage security and race problems). Once someone (possibly me) creates an implementation I expect many of the details to fall into place pretty quickly... and then it's just a matter of getting it past Greg and Al (shudder). The sad thing is that after 6 years of sysfs/udev as a "production" solution no one has done anything other than ducking the problem.

A few words on DRBD and user space - kernel interfaces

Posted Oct 20, 2009 19:18 UTC (Tue) by valyala (guest, #41196) [Link] (1 responses)

Why not using protocol buffers ( http://code.google.com/apis/protocolbuffers/docs/overview... ) for all complex extensible APIs between kernel and userspace? .proto files could be shipped together with kernel headers, so userspace programs could uniformly use them for talking to kernel. strace-like programs could dynamically decode proto messages using the corresponding .proto files.

Here is a list of protobuf features.
- protobuf messages are designed to be extensible and backwards-compatible;
- protobuf encoding is architecture-independent;
- protobuf encoding is space-efficient;
- protobuf encoding is quite simple ( http://code.google.com/apis/protocolbuffers/docs/overview... ), so it is easy to write cpu-efficient codecs with small footprint in arbitrary language;
- protobuf messages can contain another protobuf messages;
- .proto files can include another .proto files;
- encoded protobuf messages can be easily stored to files (space-efficient binary logs), which then can be easily decoded into human-readable text by universal decoders using corresponding .proto definitions;
- according to the http://code.google.com/p/protobuf/ , "Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats" ;)

A few words on DRBD and user space - kernel interfaces

Posted Oct 24, 2009 16:59 UTC (Sat) by jengelh (subscriber, #33263) [Link]

I looked at protobufs about a year ago, and it seems like libnl is doing almost the same (minus (un)serialization).

Infrastructure unification in the block layer

Posted Oct 9, 2009 15:36 UTC (Fri) by giraffedata (guest, #1954) [Link] (2 responses)

Switching drivers while the device image continues is a requirement I've never heard of. Is the point to be able to upgrade the kernel while the system runs? Is there precedent for this kind of thing?

Infrastructure unification in the block layer

Posted Oct 11, 2009 10:06 UTC (Sun) by neilbrown (subscriber, #359) [Link] (1 responses)

The original purpose for being able to switch the driver while the device remains comes from LVM, and in particular 'pvmove' which allows you to move data from one physical location to another without interrupting service.

If at one point in time, the data is being served by one device, and then later it is served by another device, then we need to be able to replug drivers at some level in the stack.

Once people realised that could be done, they quickly saw other possibilities. The one I hear the most is the idea of changing a plain disk to a RAID1 pair without unmounting. There are two difficulties with this: finding somewhere to the store the metadata that the array needs, and making the change while the data is 'live'. The former is probably solvable (not in general, but in practice in many actual situations - fdisk often leaves some blank space on the device that it partitions). The latter needs the ability to switch drivers while the device is live.

So the higher up the stack the functionality can be incorporated, the more generally useful it can be.

Switching device drivers without rebooting

Posted Oct 11, 2009 21:35 UTC (Sun) by rwmj (subscriber, #5474) [Link]

This is wanted (but not implemented AFAIK) for virtualization. The idea is that the guest is running, say, an emulated IDE or SCSI device, and then you install virtio drivers which somehow transparently take over the existing device.

Such a feature is common already in commercial hypervisors (eg. VMWare tools does it).

Rich.

ioctl replacement

Posted Oct 10, 2009 17:33 UTC (Sat) by jeremiah (subscriber, #1221) [Link] (1 responses)

It's been 10 years since I wrote anything that used ioctl(). So please excuse the dumb question, but
what is the preferred replacement for it? From what I've gathered reading the article, it seems that
sysfs is the way to go. As long as the driver supports it. Is this a correct assumption?

ioctl replacement

Posted Oct 11, 2009 9:57 UTC (Sun) by neilbrown (subscriber, #359) [Link]

(there are no dumb questions, only dumb answers :-)

I'm not sure there there is any official statement that any one or any group has made that could be deemed authoritative. However I personally think sysfs, and device-model attributes are the best way to communicate with a device driver. I don't really think there is a credible alternative.

What about LVM?

Posted Oct 15, 2009 7:19 UTC (Thu) by eduperez (guest, #11232) [Link] (1 responses)

What about LVM? How does it fit into this picture? Is it going to be unified, too? Thanks.

What about LVM?

Posted Oct 15, 2009 10:10 UTC (Thu) by mangoo (guest, #32602) [Link]

LVM uses dm (device mapper).

Infrastructure unification in the block layer

Posted Oct 21, 2009 13:03 UTC (Wed) by job (guest, #670) [Link] (2 responses)

Many thanks for an interesting article.

One thing that was not clear to me is what vendors the "vendor specific" metadata of MD refers to. Probably not Linux distribution vendors.

What would be interested is if someone with a knowledge of GEOM could highlight some of the differences and similarities of that scheme. It is the latecomer of the bunch and is now used all over the place in FreeBSD, at least until ZFS came and shook up block device management again.

Infrastructure unification in the block layer

Posted Oct 21, 2009 17:11 UTC (Wed) by dlang (guest, #313) [Link] (1 responses)

I believe (but could be wrong) that vendor specific metadata is referring to raid metadata that commercial raid vendors put on drives. MD is able to understand that data and assemble arrays based on it (and avoid overwriting it)

Infrastructure unification in the block layer

Posted Oct 22, 2009 0:09 UTC (Thu) by nix (subscriber, #2304) [Link]

This is *damn* good because it means that if your proprietary RAID card
dies, md can get the data off the disks again. (It can probably do it
anyway with RAID-5 as long as you remember the stripe size, as there
aren't many ways to arrange a RAID-5 array. But RAID-6 is harder.)

Use libdevmapper with DM, not ioctl()

Posted Apr 27, 2010 4:44 UTC (Tue) by CChittleborough (subscriber, #60775) [Link] (1 responses)

Thanks for an interesting article. I'd like to add one point for people writing userland code: using ioctl() to talk to the device mapper is a Really Bad Idea. Use libdevmapper instead.

The device mapper's ioctl() interface is arcane and ugly, but libdevmapper hides all that, protecting your sanity, saving you effort and giving you source-level compatibility if the userland/DM interface ever changes.

Use libdevmapper with DM, not ioctl()

Posted Apr 27, 2010 9:09 UTC (Tue) by nix (subscriber, #2304) [Link]

... and since the userland/DM interface has already changed once (from a filesystem to an ioctl()) it seems wise to presume that it may change again.